Computational method for designing enzymes for incorporation of non natural amino acids into proteins

ABSTRACT

The instant invention provides methods, reagents, and computational tools for designing non-natural substrate analogs for enzymes, especially for designing unnatural amino acid analogs for aminoacyl tRNA Synthetases (AARSs), such as the Phe tRNA Synthetase. The instant invention also provides methods to incorporate unnatural amino acid analogs, especially those with interesting functional groups, into protein products to generate proteins of modified or novel functions.

REFERENCE TO RELATED APPLICATION

This application claims the benefit under 35 U.S.C. 119(e) of U.S.Provisional Application No. 60/360,146, filed on Feb. 27, 2002, theentire content of which is incorporated herein by reference.

GOVERNMENT SUPPORT

Part of this work was supported by NIH Grants R01-GM62523 andT32-GM08501 awarded to the NSF Center for the Science and Engineering ofMaterials at Caltech. The Government has certain rights in thisinvention.

BACKGROUND OF THE INVENTION

Protein engineering is a powerful tool for modification of thestructural catalytic and binding properties of natural proteins and forthe de novo design of artificial proteins. Protein engineering relies onan efficient recognition mechanism for incorporating mutant amino acidsin the desired protein sequences. Though this process has been veryuseful for designing new macromolecules with precise control ofcomposition and architecture, a major limitation is that the mutagenesisis restricted to the 20 naturally occurring amino acids. However, it isbecoming increasingly clear that incorporation of non-natural aminoacids can extend the scope and impact of protein engineering methods.¹Thus, for many applications of designed macromolecules, it would bedesirable to develop methods for incorporating amino acids that havenovel chemical functionality not possessed by the 20 amino acidscommonly found in naturally occuring proteins. That is, ideally, onewould like to tailor changes in a protein (the size, acidity,nucleophilicity, hydrogen-bonding or hydrophobic properties, etc. ofamino acids) to fulfill a specific structural or functional property ofinterest. The ability to incorporate such amino acid analogs intoproteins would greatly expand our ability to rationally andsystematically manipulate the structures of proteins, both to probeprotein function and create proteins with new properties. For example,the ability to synthesize large quantities of proteins containing heavyatoms would facilitate protein structure determination, and the abilityto site specifically substitute fluorophores or photo-cleavable groupsinto proteins in living cells would provide powerful tools for studyingprotein functions in vivo. One might also be able to enhance theproperties of proteins by providing building blocks with new functionalgroups, such as an amino acid containing a keto-group.

Incorporation of novel amino acids in macromolecules has been successfulto an extent. Biosynthetic assimilation of non-canonical amino acidsinto proteins has been achieved largely by exploiting the capacity ofthe wild type synthesis apparatus to utilize analogs of naturallyoccurring amino acids (Budisa 1995, Eur. J. Biochem 230: 788–796; Deming1997, J. Macromol. Sci. Pure Appl. Chem A34; 2143–2150; Duewel 1997,Biochemistry 36: 3404–3416; van Hest and Tirrell 1998, FEBS Lett428(1–2): 68–70; Sharma et al., 2000, FEBS Lett 467(1): 37–40).Nevertheless, the number of amino acids shown conclusively to exhibittranslational activity in vivo is small, and the chemical functionalitythat has been accessed by this method remains modest. In designingmacromolecules with desired properties, this poses a limitation sincesuch designs may require incorporation of complex analogs that differsignificantly from the natural substrates in terms of both size andchemical properties and hence, are unable to circumvent the specificityof the synthetases. Thus, to further expand the range of non-naturalamino acids that can be incorporated, there is a need to manipulate theactivity of the aminoacyl tRNA synthetases (AARS).

Protein translation from an mRNA template is carried out by ribosomes.During the translation process, each tRNA is matched with its amino acidlong before it reaches the ribosome. The match is made by a collectionof enzymes known as the aminoacyl-tRNA synthetases (AARS). These enzymescharge each tRNA with the proper amino acid, thus allowing each tRNA tomake the proper translation from the genetic code of DNA (and the mRNAtranscribed from the DNA) into the amino acid code of proteins.

Most cells make twenty different aminoacyl-tRNA synthetases, one foreach type of amino acid. These twenty enzymes are each optimized forfunction with its own particular amino acid and the set of tRNAmolecules appropriate to that amino acid. Aminoacyl-tRNA synthetasesmust perform their tasks with high accuracy. Many of these enzymesrecognize their tRNA molecules using the anticodon. These enzymes makeabout one mistake in 10,000. For most amino acids, this level ofaccuracy is not too difficult to achieve, since most of the amino acidsare quite different from one another.

SUMMARY OF THE INVENTION

The instant invention provides methods and computational tools forsystematically modifying the substrate specificity of AminoAcyl tRNASynthetases (AARSs) to enable them to utilize amino acid analogs incell-free and whole cell translation systems. Specifically, the instantinvention provides methods and tools for redesigning the substratebinding site in the synthetases to facilitate the accommodation of aminoacid analogs.

Thus one aspect of the invention provides a method for generating amutant aminoacyl tRNA synthetase (AARS), comprising: (i) providing thecoordinates for a plurality of different rotamers of an amino acidanalog resulting from varying torsional angles; (ii) providing a set ofstructure coordinates for amino acid residues that define a bindingpocket for an aminoacyl tRNA synthetase; (iii) modeling interactions ofsaid rotamers with said binding pocket and identifying non-bondinteractions between said residues of said binding pocket and saidrotamers; (iv) altering one or more amino acid residues in said bindingpocket to produce one or more sets of structure coordinates that definealtered binding pockets for mutants of said aminoacyl tRNA synthetases;(v) modeling interactions of said rotamers with said one or more alteredbinding pockets; and, (vi) generating a set of optimized aminoacyl tRNAsynthetases sequences having favorable interactions with said amino acidanalog.

In one embodiment, the step generating a set of optimized aminoacyl tRNAsynthetases sequences includes a Dead-End Elimination (DEE) computationto remove rotamers from said rotamer library and use in said modelingsteps.

In one embodiment, the amino acid analog has a non-naturally occurringsidechain.

In one embodiment, the amino acid analog is a D-enantiomer.

Another aspect of the invention provides a method for synthesizing apeptide or protein incorporating one or more amino acid analogs,comprising providing a translational system including: (i) a transcript,or means for generating a transcript, that encodes a peptide orpolypeptide, and (ii) one or more mutant AARS having a mutated bindingpocket, each of which AARS catalyze incorporation of an amino acidanalog into said peptide or protein under the conditions of thetranslation system.

In one embodiment, said translation system is a whole cell thatexpresses said one or more AARS.

In one embodiment, said translation system is a cell lysate orreconstituted protein preparation that is translation competent.

In one embodiment, at least one of said mutant AARS catalyzesincorporation of said amino acid analog with a K_(cat) at least 10 foldgreater than the wild-type AARS.

In one embodiment, at least one of said mutant AARS has a sequenceidentified by the method of claim 1.

Any of these embodiments may be combined as appropriate.

Another aspect of the invention provides a peptide or polypeptideincorporating one or more amino acid analogs, which peptide orpolypeptide was produced by any of the above-describes methods orcombinations thereof.

Another aspect of the invention provides a method for conducting abiotechnology business comprising: (i) identifying a mutant AARSsequences, by the method of claim 1, which binds to a amino acid analog;(ii) providing a translation system including: (a) a transcript, ormeans for generating a transcript, that encodes a peptide orpolypeptide, (b) an AARS having said identified mutant AARS sequence,and (c) said amino acid analog, under circumstances wherein said AARScatalyzes incorporation of said amino acid analog in said peptide orpolypeptide.

In one embodiment, the method further comprises the step of providing apackaged pharmaceutical including the peptide or polypeptide, andinstructions and/or a label describing how to administer said peptide orpolypeptide.

Another aspect of the invention provides a method for generating amutant aminoacyl tRNA synthetase (AARS), comprising: (i) providing a setof structure coordinates for amino acid residues that define a bindingpocket for a mutant aminoacyl tRNA synthetase, which binding pocketvaries by at least one residue from the sequence of the wild-type formof said aminoacyl tRNA synthetase; (ii) providing a rotamer library of aplurality of amino acid analogs which represents the coordinates for aplurality of different conformations of each of said analogs resultingfrom varying torsional angles; (iii) modeling interactions of saidrotamers with said binding pocket and identifying non-bond interactionsbetween residues of the binding pocket and the amino acid analogs; (iv)identifying amino acid analogs that have favorable interactions withsaid mutant aminoacyl tRNA synthetase.

Another aspect of the invention provides a method for designing aminoacid sequence change(s) in an aminoacyl tRNA synthetase (AARS), whereinsaid change would enable said AARS to incorporate an amino acid analogof a natural amino acid substrate of said AARS into a protein in vivo,comprising: (a) establishing a three-dimentional structure model of saidAARS with said natural amino acid substrate; (b) establishing a rotamerlibrary for said amino acid analog; (c) identifying, based on saidthree-dimentional structure model, anchor residues of said AARSinteracting with the backbone of said natural amino acid substrate; (d)identifying, in the binding pocket of said AARS, amino acid position(s)involved in interaction with the side chain of said natural amino acidsubstrate as candidate variable residue position(s); (e) establishing anAARS-analog backbone structure by substituting said natural amino acidsubstrate with said amino acid analog in said three-dimentionalstructure model, wherein the geometric orientation of the backbone ofsaid amino acid analog is specified by the orientation of the backboneof said natural amino acid substrate, and wherein the backbone of saidamino acid analog and all anchor residues identified in (c) are fixed inidentity and rotameric conformation in relation to said AARS-analogbackbone structure; (f) establishing a group of potential rotamers foreach of the candidate variable residue position(s) identified in (d),wherein the group of potential rotamers for at least one of saidvariable residue residue position(s) has a rotamer selected from each ofat least two diferent amino acid side chains; and (g) analyzing theinteraction of each of the rotamers for said amino acid analog androtamers for said variable residue position(s) with all or part of theremainder of said AARS-analog backbone structure to generate a set ofoptimized protein sequences, wherein said analyzing step includes aDead-End Elimination (DEE) canmputation; wherein steps (b)–(d) arecarried out in any order.

In one embodiment, the method further comprises (h) identifyingadditional protein sequence changes that favor interaction between saidAARS and said amino acid analog, by repeating step (g) while scaling upinteractions between said amino acid analog and said variable residueposition(s).

In one embodiment, the method further comprises testing said AARS invivo for its ability, specificity, and/or efficiency for incorporatingsaid amino acid analog into a protein.

In one embodiment, said three-dimentional structure model is a knowncrystallographic or NMR structure.

In one embodiment, said three-dimentional structure model is establishedby homology modeling based on a known crystallographic or NMR structureof a homolog of said AARS.

In one embodiment, said homolog is at least about 70% identical to saidAARS in the active site region.

In one embodiment, the rotamer library for said amino acid analog is abackbone-independent rotamer library.

In one embodiment, the rotamer library for said amino acid analog is arotamer library established by varying both the χ1 and χ2 torsionalangles by ±20 degrees, in increment of 5 degrees, from the values ofsaid natural amino acid susbtrate in said AARS structure.

In one embodiment, said AARS is Phe tRNA Synthetase (PheRS). Preferably,said PheRS is from E. coli.

In one embodiment, said AARS is from a eukaryote, such as human.

In one embodiment, said amino acid analog is a derivative of at leastone of the 20 natural amino acids, with one or more functional groupsnot present in natural amino acids. For example, said functional groupcan be selected from the group consisting of: bromo-, iodo-, ethynyl-,cyano-, azido-, aceytyl, aryl ketone, a photolabile group, a fluoresentgroup, and a heavy metal. In a preferred embodiment, said amino acidanalog is a derivative of Phe.

In one embodiment, said set of optimized protein sequences comprises theglobally optimal protein sequence.

In one embodiment, said DEE computation is selected from the groupconsisting of original DEE and Goldstein DEE.

In one embodiment, said analyzing step includes the use of at least onescoring function. For example, said scoring function can be selectedfrom the group consisting of a van der Waals potential scoring function,a hydrogen bond potential scoring function, an atomic solvation scoringfunction, an electrostatic scoring function and a secondary structurepropensity scoring function. Also, said analyzing step includes the useof at least two scoring functions. In a preferred embodiment, saidanalyzing step includes the use of at least three scoring functions. Ina preferred embodiment, said analyzing step includes the use of at leastfour scoring functions. In a preferred embodiment, said atomic salvationscoring function includes a scaling factor that compensates forover-counting.

In one embodiment, the method further comprises generating a rankordered list of additional optimal sequences from said globally optimalprotein sequence. In a preferred embodiment, said generating includesthe use of a Monte Carlo search. In a preferred embodiment, the methodfurther comprises testing some or all of said protein sequences fromsaid ordered list to produce potential energy test results. In apreferred embodiment, the method further comprises analyzing thecorrespondence between said potential energy test results andtheoretical potential energy data.

In one embodiment, said amino acid analog can be buried in the bindingpocket of at least one of said set of optimized protein sequences, orwherein said amino acid analog is completely or almost completelysuperimposable with the natural amino acid substrate in saidthree-dimentional structure.

Any of the above embodiments can be combined as appropriate.

Another aspect of the invention provides a recombinant AARS proteingenerated by any of the above suitable methods or combinations thereof,said AARS protein comprising an optimized protein sequence thatincorporates an amino acid analog of a natural amino acid substrate ofsaid AARS into a protein in vivo.

Another aspect of the invention provides a nucleic acid sequenceencoding a recombinant AARS protein that can be generated by any of theabove suitable methods or combinations thereof.

Another aspect of the invention provides an expression vector comprisingthe nucleic acid sequence described above.

Another aspect of the invention provides a host cell comprising thenucleic acid sequence described above.

Another aspect of the invention provides an apparatus for generating amutant aminoacyl tRNA synthetase (AARS), said apparatus comprising: (i)means for providing the coordinates for a plurality of differentrotamers of an amino acid analog resulting from varying torsionalangles; (ii) means for providing a set of structure coordinates foramino acid residues that define a binding pocket for an aminoacyl tRNAsynthetase; (iii) means for modeling interactions of said rotamers withsaid binding pocket and identifying non-bond interactions between saidresidues of said binding pocket and said rotamers; (iv) means foraltering one or more amino acid residues in said binding pocket toproduce one or more sets of structure coordinates that define alteredbinding pockets for mutants of said aminoacyl tRNA synthetases; (v)means for modeling interactions of said rotamers with said one or morealtered binding pockets; and, (vi) means for generating a set ofoptimized aminoacyl tRNA synthetases sequences having favorableinteractions with said amino acid analog.

Another aspect of the invention provides a computer system for use ingenerating a mutant aminoacyl tRNA synthetase (AARS), said computersystem comprising computer instructions for: (i) providing thecoordinates for a plurality of different rotamers of an amino acidanalog resulting from varying torsional angles; (ii) providing a set ofstructure coordinates for amino acid residues that define a bindingpocket for an aminoacyl tRNA synthetase; (iii) modeling interactions ofsaid rotamers with said binding pocket and identifying non-bondinteractions between said residues of said binding pocket and saidrotamers; (iv) altering one or more amino acid residues in said bindingpocket to produce one or more sets of structure coordinates that definealtered binding pockets for mutants of said aminoacyl tRNA synthetases;(v) modeling interactions of said rotamers with said one or more alteredbinding pockets; and, (vi) generating a set of optimized aminoacyl tRNAsynthetases sequences having favorable interactions with said amino acidanalog.

Another aspect of the invention provides a computer-readable mediumstoring a computer program executable by a plurality of servercomputers, the computer program comprising computer instructions for:(i) providing the coordinates for a plurality of different rotamers ofan amino acid analog resulting from varying torsional angles; (ii)providing a set of structure coordinates for amino acid residues thatdefine a binding pocket for an aminoacyl tRNA synthetase; (iii) modelinginteractions of said rotamers with said binding pocket and identifyingnon-bond interactions between said residues of said binding pocket andsaid rotamers; (iv) altering one or more amino acid residues in saidbinding pocket to produce one or more sets of structure coordinates thatdefine altered binding pockets for mutants of said aminoacyl tRNAsynthetases; (v) modeling interactions of said rotamers with said one ormore altered binding pockets; and, (vi) generating a set of optimizedaminoacyl tRNA synthetases sequences having favorable interactions withsaid amino acid analog.

Another aspect of the invention provides a computer data signal embodiedin a carrier wave, comprising computer instructions for: (i) providingthe coordinates for a plurality of different rotamers of an amino acidanalog resulting from varying torsional angles; (ii) providing a set ofstructure coordinates for amino acid residues that define a bindingpocket for an aminoacyl tRNA synthetase; (iii) modeling interactions ofsaid rotamers with said binding pocket and identifying non-bondinteractions between said residues of said binding pocket and saidrotamers; (iv) altering one or more amino acid residues in said bindingpocket to produce one or more sets of structure coordinates that definealtered binding pockets for mutants of said aminoacyl tRNA synthetases;(v) modeling interactions of said rotamers with said one or more alteredbinding pockets; and, (vi) generating a set of optimized aminoacyl tRNAsynthetases sequences having favorable interactions with said amino acidanalog.

Another aspect of the invention provides an apparatus comprising acomputer readable storage medium having instructions stored thereon for:(i) accessing a datafile representative of the coordinates for aplurality of different rotamers of an amino acid analog resulting fromvarying torsional angles; (ii) accessing a datafile representative of aset of structure coordinates for amino acid residues that define abinding pocket for an aminoacyl tRNA synthetase; (iii) a set of modelingroutines for (a) calculating interactions of said rotamers with saidbinding pocket and identifying non-bond interactions between residues ofthe binding pocket and the amino acid analog; (b) altering one or moreamino acid residues in the binding pocket for the aminoacyl tRNAsynthetase to produce one or more sets of structure coordinates thatdefine altered binding pockets for mutants of said aminoacyl tRNAsynthetases; (c) calculating interactions of said rotamers with said oneor more altered binding pockets; and (d) generating a listrepresentative of optimized aminoacyl tRNA synthetases sequences havingfavorable interactions with said amino acid analog.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 Structure of phenylalanine (1) and p-acetylphenylalanine (2).

FIG. 2 SDS-PAGE of cell lysates of 4 hr post-induction with 1 mM IPTG.Expression plasmids and amino acid supplements are indicated.Concentration of 1 (Phe)=20 mg/L; 2 (p-AcetylPhe)=250 mg/L. Lane M:molecular weight marker (35.5, 31, 21.5, and 14.4 kDa).

FIG. 3 MALDI-TOF mass spectra of tryptic peptides digested from mDHFRexpressed in media supplemented with 1(a) or 2(b). Peptides A and B arerepresented by SEQ ID NOs: 1 and 2, respectively.

FIG. 4 Western blot showing chemoselective modification of ketonefunctionality in mDHFR. (a) Modified protein was treated with biotinhydrazide (BH), stained with HRP conjugated streptavidin and analyzed bywestern blot. (b) Western blot analysis of the products. Lane 1:mDHFR-wt+buffer; Lane 2: mDHFR-2+buffer, lane 3: mDHFR-wt+BH; Lane 4:mDHFR-2+BH.

FIG. 5 MALDI-MS (Matrix-Assisted Laser Desorption Ionization MassSpectrometry) of purified proteins.

FIG. 6 Active sites in tPheRS. (a) wild type; (b) A314G mutant; (c)T261G/A314G double mutant.

FIG. 7 The binding pocket for Phe in PheRS. Phenylalanine zwitterion isrepresented in purple. The anchor residues are labeled in bold.

FIG. 8 Analogs tested for incorporation in in vivo assays. FPA and SPAare the only two analogs that get incorporated by the natural PheRS.

DETAILED DESCRIPTION OF THE INVENTION

I. Overview

The instant invention provides methods and computational tools formodifying the substrate specificity of an AminoAcyl tRNA Synthetases(AARSs) through mutation to enable the enzyme to more efficientlyutilize amino acid analog(s) in protein translation systems, either invitro or in whole cells. A salient feature to the instant invention aremethods and tools for systematically redesigning the substrate bindingsite of an AARS enzyme to facilitate the use of of unnatural substratesin the peptide or protein translation reaction the enzyme catalyzes.

In one embodiment, the subject method uses a rotamer library for theintended amino acid analog (or “analog”), e.g., which represents thecoordinates for a plurality of different conformations of the analogresulting from varying torsional angles in the molecule. Members of therotamer library are modeled in a docking simulation with a model of anAARS, e.g., which includes the coordinates for the substrate bindingsite of the enzyme. Amino acid residues (or “residues”) in the bindingpocket of the enzyme that interact with the amino acid analogunfavorably are varied in models of mutant AARS enzymes, the dockingsimulations are repeated, and mutant forms of the AARS enzyme whichinteract favorably with the amino acid analog are identified.

In certain embodiments, the computational method is developed to enhancethe interactions between the analog and the protein positions. This isdone by scaling up the pair-wise energies between the analog and theamino acids allowed at the design positions on the protein in the energycalculations. In an optimization calculation where the AARS-analoginteractions are scaled up compared to the intra-protein interactions,sequence selection is biased toward selecting amino acid residues ofAARS to be those that have favorable interaction with the analog.

In other embodiments, rather than holding the identity of the amino acidanalog constant and varying the AARS structure (by modeling severaldifferent mutant structures), the subject method is carried out usingthe molecular model(s) for a single AARS mutant and sampling a varietyof different amino acid analogs or potential fragments thereof, toidentify analogs which are likely to interact with, and be substratesfor the mutant AARS enzyme. This approach can make use of coordinatelibraries for amino acid analogs (including rotamer variants) orlibraries of functional groups and spacers that can be joined to formthe sidechain of an amino acid analog.

Yet another aspect of the invention provides for use of a mutant AARSidentified by one of the subject computational methods in the synthesisof a peptide or protein. Such mutant AARS enzymes can be used incell-free translation systems, such as extracts or purified proteintranslation systems, or recombinantly expressed in whole cells,including in cells from a different organism. For example, AARS from onebacteria species (or a eukaryotic species such as yeast) can berecombinantly expressed in a different bacteria (or a differenteukaryotic species such as insect, mammal or plant).

Another aspect of the invention provides peptides and polypeptideproduced by a translation system using one or more of the subject AARSmutants, which peptides and polypeptides have one or more non-naturallyoccurring amino acid residues.

II. Definitions

The terms used in this specification generally have their ordinarymeanings in the art, within the context of this invention and in thespecific context where each term is used. Certain terms are discussedbelow or elsewhere in the specification, to provide additional guidanceto the practitioner in describing the compositions and methods of theinvention and how to make and use them. The scope an meaning of any useof a term will be apparent from the specific context in which the termis used.

“About” and “approximately” shall generally mean an acceptable degree oferror for the quantity measured given the nature or precision of themeasurements. Typical, exemplary degrees of error are within 20 percent(%), preferably within 10%, and more preferably within 5% of a givenvalue or range of values. Alternatively, and particularly in biologicalsystems, the terms “about” and “approximately” may mean values that arewithin an order of magnitude, preferably within 5-fold and morepreferably within 2-fold of a given value. Numerical quantities givenherein are approximate unless stated otherwise, meaning that the term“about” or “approximately” can be inferred when not expressly stated.

“Amino acid analog” is meant to include all amino acid-like compoundsthat are similar in structure and/or overall shape to one or more of thetwenty L-amino acids commonly found in naturally occuring proteins (Alaor A, Cys or C, Asp or D, Glu or E, Phe or F, Gly or G, His or H, Ile orI, Lys or K, Leu or L, Met or M, Asn or N, Pro or P, Gln or Q, Arg or R,Ser or S, Thr or T, Val or V, Trp or W, Tyr or Y, as defined and listedin WIPO Standard ST.25 (1998), Appendix 2, Table 3). Amino acid analogcan also be natural amino acids with modified side chains or backbones.Preferably, these analogs usually are not “substrates” for the AARSsbecause of the high specificity of the AARSs, although occasionally,certain analogs with structures/shapes sufficiently close to naturalamino acids may be erroneously incorporated into the protein by AARSs,especially mutant AARSs with relaxed substrate specificity. In apreferred embodiment, the analogs share backbone structures, and/or eventhe most side chain structures of one or more natural amino acids, withthe only difference(s) being conatining one or more modified groups inthe molecule. Such modification may include, without limitation,substitution of an atom (such as N) for a related atom (such as S),addition of a group (such as methyl, or hydroxyl group, etc.) or an atom(such as Cl or Br, etc.), deletion of a group (supra), substitution of acovalent bond (single bond for double bond, etc.), or combinationsthereof. Amino acid analogs may include α-hydroxy acids, and β-aminoacids, and can also be referred to as “modified amino acids,” “unnaturalAARS substrates” or “non-canonical amino acids.”

The amino acid analogs may either be naturally occuring or non-naturallyoccuring; as will be appreciated by those in the art, any structure forwhich a set of rotamers is known or can be generated can be used as anamino acid analog. The side chains may be in either the (R) or the (S)configuration (or D- or L-configuration). In a preferred embodiment, theamino acids are in the (S) or L-configuration.

Preferably, the overall shape and size of the amino acid analogs aresuch that, upon being charged to tRNAs by the re-designed AARS, theanalog-tRNA is a ribosomally accepted complex, that is, the tRNA-analogcomplex can be accepted by the prokaryotic or eukaryotic ribosomes in anin vivo or in vitro translation system.

“Achor residues” are residue positions in AARS that maintain criticalinteractions between the AARS and the natural amino acid backbone.

“Backbone,” or “template” includes the backbone atoms and any fixed sidechains (such as the anchor residue side chains) of the protein (e.g.,AARS). For calculation purposes, the backbone of an analog is treated aspart of the AARS backbone.

“Protein backbone structure” or grammatical equivalents herein is meantthe three dimensional coordinates that define the three dimensionalstructure of a particular protein. The structures which comprise aprotein backbone structure (of a naturally occuring protein) are thenitrogen, the carbonyl carbon, the α-carbon, and the carbonyl oxygen,along with the direction of the vector from the α-carbon to theβ-carbon.

The protein backbone structure which is input into the computer caneither include the coordinates for both the backbone and the amino acidside chains, or just the backbone, i.e. with the coordinates for theamino acid side chains removed. If the former is done, the side chainatoms of each amino acid of the protein structure may be “stripped” orremoved from the structure of a protein, as is known in the art, leavingonly the coordinates for the “backbone” atoms (the nitrogen, carbonylcarbon and oxygen, and the α-carbon, and the hydrogens attached to thenitrogen and α-carbon).

Optionally, the protein backbone structure may be altered prior to theanalysis outlined below. In this embodiment, the representation of thestarting protein backbone structure is reduced to a description of thespatial arrangement of its secondary structural elements. The relativepositions of the secondary structural elements are defined by a set ofparameters called supersecondary structure parameters. These parametersare assigned values that can be systematically or randomly varied toalter the arrangement of the secondary structure elements to introduceexplicit backbone flexibility. The atomic coordinates of the backboneare then changed to reflect the altered supersecondary structuralparameters, and these new coordinates are input into the system for usein the subsequent protein design automation. For details, see U.S. Pat.No. 6,269,312, the entire content incorporated herein by reference.

“Conformational energy” refers generally to the energy associated with aparticular “conformation”, or three-dimensional structure, of amacromolecule, such as the energy associated with the conformation of aparticular protein. Interactions that tend to stabilize a protein haveenergies that are represented as negative energy values, whereasinteractions that destabilize a protein have positive energy values.Thus, the conformational energy for any stable protein is quantitativelyrepresented by a negative conformational energy value. Generally, theconformational energy for a particular protein will be related to thatprotein's stability. In particular, molecules that have a lower (i.e.,more negative) conformational energy are typically more stable, e.g., athigher temperatures (i.e., they have greater “thermal stability”).Accordingly, the conformational energy of a protein may also be referredto as the “stabilization energy.”

Typically, the conformational energy is calculated using an energy“force-field” that calculates or estimates the energy contribution fromvarious interactions which depend upon the conformation of a molecule.The force-field is comprised of terms that include the conformationalenergy of the alpha-carbon backbone, side chain—backbone interactions,and side chain—side chain interactions. Typically, interactions with thebackbone or side chain include terms for bond rotation, bond torsion,and bond length. The backbone-side chain and side chain-side chaininteractions include van der Waals interactions, hydrogen-bonding,electrostatics and solvation terms. Electrostatic interactions mayinclude coulombic interactions, dipole interactions and quadrapoleinteractions). Other similar terms may also be included. Force-fieldsthat may be used to determine the conformational energy for a polymerare well known in the art and include the CHARMM (see, Brooks et al, J.Comp. Chem. 1983, 4:187–217; MacKerell et al., in The Encyclopedia ofComputational Chemistry, Vol. 1:271–277, John Wiley & Sons, Chichester,1998), AMBER (see, Cornell et al., J. Amer. Chem. Soc. 1995, 117:5179;Woods et al., J. Phys. Chem. 1995, 99:3832–3846; Weiner et al., J. Comp.Chem. 1986, 7:230; and Weiner et al., J. Amer. Chem. Soc. 1984, 106:765)and DREIDING (Mayo et al., J. Phys. Chem. 1990, 94-:8897) force-fields,to name but a few.

In a preferred implementation, the hydrogen bonding and electrostaticsterms are as described in Dahiyat & Mayo, Science 1997 278:82). Theforce field can also be described to include atomic conformational terms(bond angles, bond lengths, torsions), as in other references. See e.g.,Nielsen J E, Andersen K V, Honig B, Hooft R W W, Klebe G, Vriend G, &Wade R C, “Improving macromolecular electrostatics calculations,”Protein Engineering, 12: 657662(1999); Stikoff D, Lockhart D J, Sharp KA & Honig B, “Calculation of electrostatic effects at the amino-terminusof an alpha-helix,” Biophys. J., 67: 2251–2260 (1994); Hendscb Z S,Tidor B, “Do salt bridges stabilize proteins—a continuum electrostaticanalysis,” Protein Science, 3: 211–226 (1994); Schneider J P, Lear J D,DeGrado W F, “A designed buried salt bridge in a heterodimeric coil,” J.Am. Chem. Soc., 119: 5742–5743 (1997); Sidelar C V, Hendsch Z S, TidorB, “Effects of salt bridges on protein structure and design,” ProteinScience, 7: 1898–1914 (1998). Solvation terms could also be included.See e.g., Jackson S E, Moracci M, elMastry N, Johnson C M, Fersht A R,“Effect of Cavity-Creating Mutations in the Hydrophobic Core ofChymotrypsin Inhibitor 2,” Biochemistry, 32: 11259–11269 (1993);Eisenberg, D & McLachlan A D, “Solvation Energy in Protein Folding andBinding,” Nature, 319: 199–203 (1986); Street A G & Mayo S L, “PairwiseCalculation of Protein Solvent-Accessible Surface Areas,” Folding &Design, 3: 253–258 (1998); Eisenberg D & Wesson L, “Atomic salvationparameters applied to molecular dynamics of proteins in solution,”Protein Science, 1: 227–235 (1992); Gordon & Mayo, supra.

“Coupled residues” are residues in a molecule that interact, through anymechanism. The interaction between the two residues is thereforereferred to as a “coupling interaction.” Coupled residues generallycontribute to polymer fitness through the coupling interaction.Typically, the coupling interaction is a physical or chemicalinteraction, such as an electrostatic interaction, a van der Waalsinteraction, a hydrogen bonding interaction, or a combination thereof.As a result of the coupling interaction, changing the identity of eitherresidue will affect the “fitness” of the molecule, particularly if thechange disrupts the coupling interaction between the two residues.Coupling interaction may also be described by a distance parameterbetween residues in a molecule. If the residues are within a certaincutoff distance, they are considered interacting.

“Fitness” is used to denote the level or degree to which a particularproperty or a particular combination of properties for a molecule, e.g.,a protein, are optimized. In certain embodiments of the invention, thefitness of a protein is preferably determined by properties which a userwishes to improve. Thus, for example, the fitness of a protein may referto the protein's thermal stability, catalytic activity, bindingaffinity, solubility (e.g., in aqueous or organic solvent), and thelike. Other examples of fitness properties include enantioselectivity,activity towards non-natural substrates, and alternative catalyticmechanisms. Coupling interactions can be modeled as a way of evaluatingor predicting fitness (stability). Fitness can be determined orevaluated experimentally or theoretically, e.g. computationally.

Preferably, the fitness is quantitated so that each molecule, e.g., eachamino acid will have a particular “fitness value”. For example, thefitness of a protein may be the rate at which the protein catalyzes aparticular chemical reaction, or the protein's binding affinity for aligand. In a particularly preferred embodiment, the fitness of a proteinrefers to the conformational energy of the polymer and is calculated,e.g., using any method known in the art. See, e.g. Brooks B. R.,Bruccoleri R E, Olafson, B D, States D J, Swaminathan S & Karplus M,“CHARMM: A Program for Macromolecular Energy, Minimization, and DynamicsCalculations,” J. Comp. Chem., 4: 187–217 (1983); Mayo S L, Olafson B D& Goddard W A G, “DREIDING: A Generic Force Field for MolecularSimulations,” J. Phys. Chem., 94: 8897–8909 (1990); Pabo C O & SuchanekE G, “Computer-Aided Model-Building Strategies for Protein Design,”Biochemistry, 25: 5987–5991 (1986), Lazar G A, Desjarlais J R & Handel TM, “De Novo Design of the Hydrophobic Core of Ubiquitin,” ProteinScience, 6: 1167–1178 (1997); Lee C & Levitt M, “Accurate Prediction ofthe Stability and Activity Effects of Site Directed Mutagenesis on aProtein Core,” Nature, 352: 448–451 (1991); Colombo G & Merz K M,“Stability and Activity of Mesophilic Subtilisin E and Its ThermophilicHomolog: Insights from Molecular Dynamics Simulations,” J. Am. Chem.Soc., 121: 6895–6903 (1999); Weiner S J, Kollman P A, Case D A, Singh UC, Ghio C, Alagona G, Profeta S J, Weiner P, “A new force field formolecular mechanical simulation of nucleic acids and proteins,” J. Am.Chem. Soc., 106: 765–784 (1984). Generally, the fitness of a protein isquantitated so that the fitness value increases as the property orcombination of properties is optimized. For example, in embodimentswhere the thermal stability of a protein is to be optimized(conformational energy is preferably decreased), the fitness value maybe the negative conformationl energy; i.e., F=−E.

The “fitness contribution” of a protein residue refers to the level orextent f(i_(a)) to which the residue i_(a), having an identity a,contributes to the total fitness of the protein. Thus, for example, ifchanging or mutating a particular amino acid residue will greatlydecrease the protein's fitness, that residue is said to have a highfitness contribution to the polymer. By contrast, typically someresidues i_(a) in a protein may have a variety of possible identities awithout affecting the protein's fitness. Such residues, therefore have alow contribution to the protein fitness.

“Dead-end elimination” (DEE) is a deterministic search algorithm thatseeks to systematically eliminate bad rotamers and combinations ofrotamers until a single solution remains. For example, amino acidresidues can be modeled as rotamers that interact with a fixed backbone.The theoretical basis for DEE provides that, if the DEE searchconverges, the solution is the global minimum energy conformation (GMEC)with no uncertainty (Desmet et al., 1992).

Dead end elimination is based on the following concept. Consider tworotamers, i_(r) and i_(t), at residue i, and the set of all otherrotamer configurations {S} at all residues excluding i (of which rotamerj_(s) is a member). If the pairwise energy contributed between i_(r) andj_(s) is higher than the pairwise energy between i_(t) and j_(s) for all{S}, then rotamer i_(r) cannot exist in the global minimum energyconformation, and can be eliminated. This notion is expressedmathematically by the inequality.

$\begin{matrix}{{{E\left( i_{r} \right)} + {\sum\limits_{j \neq i}^{N}\;{E\left( {i_{r},j_{s}} \right)}}} > {{E\left( i_{t} \right)} + {\sum\limits_{j \neq i}^{N}{{E\left( {i_{t},j_{s}} \right)}\left\{ S \right\}}}}} & \text{(Equation~~A)}\end{matrix}$

If this expression is true, the single rotamer i_(r) can be eliminated(Desmet et al., 1992).

In this form, Equation A is not computationally tractable because, tomake an elimination, it is required that the entire sequence (rotamer)space be enumerated. To simplify the problem, bounds implied by EquationA can be utilized:

$\begin{matrix}{{{E\left( i_{r} \right)} + {\sum\limits_{j \neq i}^{N}\;{{\min(s)}{E\left( {i_{r},j_{s}} \right)}}}} > {{E\left( i_{t} \right)} + {\sum\limits_{j \neq i}^{N}{{\max(s)}{E\left( {i_{t},j_{s}} \right)}\left\{ S \right\}}}}} & \text{(Equation~~B)}\end{matrix}$

Using an analogous argument, Equation B can be extended to theelimination of pairs of rotamers inconsistent with the GMEC. This isdone by determining that a pair of rotamers i_(r) at residue i and j_(s)at residue j, always contribute higher energies than rotamers i_(u) andj_(v) with all possible rotamer combinations {L}. Similar to Equation B,the strict bound of this statement is given by:

$\begin{matrix}{{{ɛ\left( {i_{r},j_{s}} \right)} + {\sum\limits_{{k \neq i},j}^{N}\;{{\min(t)}{ɛ\left( {i_{r},j_{s},k_{t}} \right)}}}} > {{ɛ\left( {i_{u},j_{v}} \right)} + {\sum\limits_{{k \neq i},j}^{N}{{\max(t)}{ɛ\left( {i_{u},j_{v},k_{i}} \right)}}}}} & \text{(Equation~~C)}\end{matrix}$

-   -   where ε is the combined energies for rotamer pairs        ε(i _(r) ,j _(s))=E(i _(r))+E(j _(s))+E(i _(r) ,j        _(s)  (Equation D),        and        ε(i _(r) ,j _(s) ,k _(t))=E(i _(r) ,k _(t))+E(j _(s) ,k        _(t)  (Equation E).

This leads to the doubles elimination of the pair of rotamers i_(r) andj_(s), but does not eliminate the individual rotamers completely aseither could exist independently in the GMEC. The doubles eliminationstep reduces the number of possible pairs (reduces S) that need to beevaluated in the right-hand side of Equation 6, allowing more rotamersto be individually eliminated.

The singles and doubles criteria presented by Desmet et al. fail todiscover special conditions that lead to the determination of moredead-ending rotamers For instance, it is possible that the energycontribution of rotamer i_(t) is always lower than i_(r) without themaximum of i_(t) being below the minimum of i_(r). To address thisproblem, Goldstein 1994 presented a modification of the criteria thatdetermines if the energy profiles of two rotamers cross. If they do not,the higher energy rotamer can be determined to be dead-ending. Thedoubles calculation significantly more computational time than thesingles calculation. To accelerate the process, other computationalmethods have been developed to predict the doubles calculations thatwill be the most productive (Gordon & Mayo, 1998). These kinds ofmodifications, collectively referred to as fast doubles, significantlyimproved the speed and effectiveness of DEE.

Several other modifications also enhance DEE. Rotamers from multipleresidues can be combined into so-called super-rotamers to prompt furthereliminations (Desmet et al., 1994; Goldstein, 1994). This has theadvantage of eliminating multiple rotamers in a single step. Inaddition, it has been shown that “splitting” the conformational spacebetween rotamers improves the efficiency of DEE (Pierce et al., 2000).Splitting handles the following special case. Consider rotamer i_(r). Ifa rotamer i_(t1) contributes a lower energy than i_(r) for a portion ofthe conformational space, and a rotamer i_(t2) has a lower energy thani_(r) for the remaining fraction, then i_(r) can be eliminated. Thiscase would not be detected by the less sensitive Desmet or Goldsteincriteria. In the preferred implementations of the invention as describedherein, all of the described enhancements to DEE were used.

For further discussion of these methods see, Goldstein, R. F. (1994),Efficient rotamer elimination applied to protein side-chains and relatedspin glasses, Biophysical Journal 66, 1335–1340, Desmet, J., De Maeyer,M., Hazes, B. & Lasters, I. (1992), The dead-end elimination theorem andits use in protein side-chain positioning. Nature 356, 539–542; Desmet,J., De Maeyer, M. & Lasters, I. (1994), In The Protein Folding Problemand Tertiary Structure Prediction (Jr., K. M. & Grand, S. L., eds.), pp.307–337 (Birkhauser, Boston); De Maeyer, M., Desmet, J. & Lasters, 1.(1997), All in one: a highly detailed rotamer library improves bothaccuracy and speed in the modeling of side chains by dead-endelimination, Folding & Design 2, 53–66, Gordon, D. B. & Mayo, S. L.(1998), Radical performance enhancements for combinatorial optimizationalgorithms based on the dead-end elimination theorem, Journal ofComputational Chemistry 19, 1505–1514; Pierce, N. A., Spriet, J. A.,Desmet, J., Mayo, S. L., (2000), Conformational splitting: A morepowerful criterion for dead-end elimination; Journal of ComputationalChemistry 21, 999–1009.

“Expression system” means a host cell and compatible vector undersuitable conditions, e.g. for the expression of a protein coded for byforeign DNA carried by the vector and introduced to the host cell.Common expression systems include E. coli host cells and plasmidvectors, insect host cells such as Sf9, Hi5 or S2 cells and Baculovirusvectors, Drosophila cells (Schneider cells) and expression systems, andmammalian host cells and vectors.

“Host cell” means any cell of any organism that is selected, modified,transformed, grown or used or manipulated in any way for the productionof a substance by the cell. For example, a host cell may be one that ismanipulated to express a particular gene, a DNA or RNA sequence, aprotein or an enzyme. Host cells may be cultured in vitro or one or morecells in a non-human animal (e.g., a transgenic animal or a transientlytransfected animal).

The methods of the invention may include steps of comparing sequences toeach other, including wild-type sequence to one or more mutants. Suchcomparisons typically comprise alignments of polymer sequences, e.g.,using sequence alignment programs and/or algorithms that are well knownin the art (for example, BLAST, FASTA and MEGALIGN, to name a few). Theskilled artisan can readily appreciate that, in such alignments, where amutation contains a residue insertion or deletion, the sequencealignment will introduce a “gap” (typically represented by a dash, “-”,or “Δ”) in the polymer sequence not containing the inserted or deletedresidue.

“Homologous”, in all its grammatical forms and spelling variations,refers to the relationship between two proteins that possess a “commonevolutionary origin”, including proteins from superfamilies in the samespecies of organism, as well as homologous proteins from differentspecies of organism. Such proteins (and their encoding nucleic acids)have sequence homology, as reflected by their sequence similarity,whether in terms of percent identity or by the presence of specificresidues or motifs and conserved positions.

The term “sequence similarity”, in all its grammatical forms, refers tothe degree of identity or correspondence between nucleic acid or aminoacid sequences that may or may not share a common evolutionary origin(see, Reeck et al., supra). However, in common usage and in the instantapplication, the term “homologous”, when modified with an adverb such as“highly”, may refer to sequence similarity and may or may not relate toa common evolutionary origin.

A nucleic acid molecule is “hybridizable” to another nucleic acidmolecule, such as a cDNA, genomic DNA, or RNA, when a single strandedform of the nucleic acid molecule can anneal to the other nucleic acidmolecule under the appropriate conditions of temperature and solutionionic strength (see Sambrook et al., Molecular Cloning: A LaboratoryManual, Second Edition (1989) Cold Spring Harbor Laboratory Press, ColdSpring Harbor, N.Y.). The conditions of temperature and ionic strengthdetermine the “stringency” of the hybridization. For preliminaryscreening for homologous nucleic acids, low stringency hybridizationconditions, corresponding to a T_(m) (melting temperature) of 55° C.,can be used, e.g., 5×SSC, 0.1% SDS, 0.25% milk, and no formamide; or 30%formamide, 5×SSC, 0.5% SDS). Moderate stringency hybridizationconditions correspond to a higher T_(m), e.g., 40% formamide, with 5× or6×SSC. High stringency hybridization conditions correspond to thehighest T_(m), e.g., 50% formamide, 5×or 6×SSC. SSC is a 0.15M NaCl,0.015M Na-citrate. Hybridization requires that the two nucleic acidscontain complementary sequences, although depending on the stringency ofthe hybridization, mismatches between bases are possible. Theappropriate stringency for hybridizing nucleic acids depends on thelength of the nucleic acids and the degree of complementation, variableswell known in the art. The greater the degree of similarity or homologybetween two nucleotide sequences, the greater the value of T_(m) forhybrids of nucleic acids having those sequences. The relative stability(corresponding to higher T_(m)) of nucleic acid hybridizations decreasesin the following order: RNA:RNA, DNA:RNA, DNA:DNA. For hybrids ofgreater than 100 nucleotides in length, equations for calculating T_(m)have been derived (see Sambrook et al., supra, 9.50–9.51). Forhybridization with shorter nucleic acids, i.e., oligonucleotides, theposition of mismatches becomes more important, and the length of theoligonucleotide determines its specificity (see Sambrook et al., supra,11.7–11.8). A minimum length for a hybridizable nucleic acid is at leastabout 10 nucleotides; preferably at least about 15 nucleotides; and morepreferably the length is at least about 20 nucleotides.

Unless specified, the term “standard hybridization conditions” refers toa T_(m) of about 55° C., and utilizes conditions as set forth above. Ina preferred embodiment, the T_(m) is 60° C.; in a more preferredembodiment, the T_(m) is 65° C. In a specific embodiment, “highstringency” refers to hybridization and/or washing conditions at 68° C.in 0.2×SSC, at 42° C. in 50% formamide, 4×SSC, or under conditions thatafford levels of hybridization equivalent to those observed under eitherof these two conditions.

Suitable hybridization conditions for oligonucleotides (e.g., foroligonucleotide probes or primers) are typically somewhat different thanfor full-length nucleic acids (e.g., full-length cDNA), because of theoligonucleotides' lower melting temperature. Because the meltingtemperature of oligonucleotides will depend on the length of theoligonucleotide sequences involved, suitable hybridization temperatureswill vary depending upon the oligoncucleotide molecules used. Exemplarytemperatures may be 37° C. (for 14-base oligonucleotides), 48° C. (for17-base oligoncucleotides), 55° C. (for 20-base oligonucleotides) and60° C. (for 23-base oligonucleotides). Exemplary suitable hybridizationconditions for oligonucleotides include washing in 6×SSC/0.05% sodiumpyrophosphate, or other conditions that afford equivalent levels ofhybridization.

“Polypeptide,” “peptide” or “protein” are used interchangably todescribe a chain of amino acids that are linked together by chemicalbonds called “peptide bonds.” A protein or polypeptide, including anenzyme, may be a “native” or “wild-type”, meaning that it occurs innature; or it may be a “mutant”, “variant” or “modified”, meaning thatit has been made, altered, derived, or is in some way different orchanged from a native protein or from another mutant.

“Rotamer” is defined as a set of possible conformers for each amino acidor analog side chain. See Ponder, et al., Acad. Press Inc. (London) Ltd.pp. 775–791 (1987); Dunbrack, et al., Struc. Biol. 1(5):334–340 (1994);Desmet, et al., Nature 356:539–542 (1992). A “rotamer library” is acollection of a set of possible/allowable rotametic conformations for agiven set of amino acids or analogs. There are two general types ofrotamer libraries: “backbone dependent” and “backbone independent.” Abackbone dependent rotamer library allows different rotamers dependingon the position of the residue in the backbone; thus for example,certain leucine rotamers are allowed if the position is within an αhelix, and different leucine rotamers are allowed if the position is notin an α-helix. A backbone independent rotamer library utilizes allrotamers of an amino acid at every position. In general, a backboneindependent library is preferred in the consideration of core residues,since flexibility in the core is important. However, backboneindependent libraries are computationally more expensive, and thus forsurface and boundary positions, a backbone dependent library ispreferred. However, either type of library can be used at any position.

“Variable residue position” herein is meant an amino acid position ofthe protein to be designed that is not fixed in the design method as aspecific residue or rotamer, generally the wild-type residue or rotamer.It should be noted that even if a position is chosen as a variableposition, it is possible that the methods of the invention will optimizethe sequence in such a way as to select the wild type residue at thevariable position. This generally occurs more frequently for coreresidues, and less regularly for surface residues. In addition, it ispossible to fix residues as non-wild type amino acids as well.

“Fixed residue position” means that the residue identified in the threedimensional structure as being in a set conformation. In someembodiments, a fixed position is left in its original conformation(which may or may not correlate to a specific rotamer of the rotamerlibrary being used). Alternatively, residues may be fixed as a non-wildtype residue depending on design needs; for example, when knownsite-directed mutagenesis techniques have shown that a particularresidue is desirable (for example, to eliminate a proteolytic site oralter the substrate specificity of an AARS), the residue may be fixed asa particular amino acid. Residues which can be fixed include, but arenot limited to, structurally or biologically functional residues. Forexample, the anchor residues.

In certain embodiments, a fixed position may be “floated”; the aminoacid or analog at that position is fixed, but different rotamers of thatamino acid or analog are tested. In this embodiment, the variableresidues may be at least one, or anywhere from 0.1% to 99.9% of thetotal number of residues. Thus, for example, it may be possible tochange only a few (or one) residues, or most of the residues, with allpossibilities in between.

III. Illustrative Embodiments

A. Available Sequence and Structural Information for tRNA Synthetases

An accurate description of the binding pocket is important for thecomputational design approach of the instant invention since it dependson the crystal structure for the protein backbone descriptions, althoughin many cases it is perfectly acceptable to use crystal structure of ahomologous protein (for example, a homolog from a related species) oreven a conserved domain to substitute the crystallographic bindingpocket structure description. The crystal structure also defines theorientation of the natural substrate amino acid in the binding pocket ofa synthetase, as well as the relative position of the amino acidsubstrate to the synthetase residues, especially those residues in andaround the binding pocket. To design the binding pocket for the analogs,it is preferred that these analogs bind to the synthetase in the sameorientation as the natural substrate amino acid, since this orientationmay be important for the adenylation step.

The AARSs may be from any organism, including prokaryotes andeukaryotes, with enzymes from bacteria, fungi, extremeophiles such asthe archebacteria, worm, insects, fish, amphibian, birds, animals(particularly mammals and particularly human) and plants all possible.

As described above, most cells make twenty different aminoacyl-tRNAsynthetases, one for each type of amino acid. The crystal structures ofnearly all of these different enzymes are currently available in theBrookhaven Protein Data Bank (PDB, see Bernstein et al., J. Mol. Biol.112: 535–542, 1977). A list of all the AARSs with solved crystalstructures as of April 2001 is available on the PDB website. Forexample, the crystal structure of Thermus Aquaticus Phenylalanyl tRNASynthetase complexed with Phenylalanine has a resolution of 2.7 Å, andits PDB ID is 1B70.

The structure database or Molecular Modeling DataBase (MMDB) containsexperimental data from crystallographic and NMR structuredeterminations. The data for MMDB are obtained from the Protein DataBank (PDB). The NCBI (National Center for Biotechnology Information) hascross-linked structural data to bibliographic information, to thesequence databases, and to the NCBI taxonomy. Cn3D, the NCBI 3Dstructure viewer, can be used for easy interactive visualization ofmolecular structures from Entrez.

The Entrz 3D Domains database contains protein domains from the NCBIConserved Domain Database (CDD). Computational biologists defineconserved domains based on recurring sequence patterns or motifs. CDDcurrently contains domains derived from two popular collections, Smartand Pfam, plus contributions from colleagues at NCBI, such as COG. Thesource databases also provide descriptions and links to citations. Sinceconserved domains correspond to compact structural units, CDs containlinks to 3D-structure via Cn3D whenever possible.

To identify conserved domains in a protein sequence, the CD-Searchservice employs the reverse position-specific BLAST algorithm. The querysequence is compared to a position-specific score matrix prepared fromthe underlying conserved domain alignment. Hits may be displayed as apairwise alignment of the query sequence with a representative domainsequence, or as a multiple alignment. CD-Search now is run by default inparallel with protein BLAST searches. While the user waits for the BLASTqueue to further process the request, the domain architecture of thequery may already be studied. In addition, CDART, the Conserved DomainArchitecture Retrieval Tool allows user to search for proteins withsimilar domain architectures. CDART uses precomputed CD-search resultsto quickly identify proteins with a set of domains similar to that ofthe query. For more details, see Marchler-Bauer et al., Nucleic AcidsResearch 31: 383–387, 2003; and Marchler-Bauer et al., Nucleic AcidsResearch 30: 281–283, 2002.

In addition, a database of known aminoacyl tRNA synthetases has beenpublished by Maciej Szymanski, Marzanna A. Deniziak and JanBarciszewski, in Nucleic Acids Res. 29:288–290, 2001 (titled“Aminoacyl-tRNA synthetases database”). A corresponding website(http://rose.man.poznan.pl/aars/seq_main.html) provides details aboutall known AARSs from different species. For example, according to thedatabase, the Isoleucyl-tRNA Synthetase for the radioresistant bacteriaDeinococcus radiodurans (Accession No. AAF10907) has 1078 amino acids,and was published by White et al. in Science 286:1571–1577(1999); theValyl-tRNA Synthetase for mouse (Mus musculus) has 1263 amino acids(Accession No. AAD26531), and was published by Snoek M. and van Vugt H.in Immunogenetics 49: 468–470(1999); and the Phenylalanyl-tRNASynthetase sequences for human, Drosophila, S. pombe, S. cerevisiae,Candida albicans, E. coli, and mumerous other bacteria including Thermusaquaticus ssp. thermophilus are also available. The database was lastupdated in May 2000. Similar information for other newly identifiedAARSs can be obtained, for example, by conducting a BLAST search usingany of the known sequences in the AARS database as query against theavailable public (such as the non-redundant database at NCBI, or “nr”)or proprietory private databases.

Alternatively, in certain embodiments, if the exact crystal structure ofa particular AARS is not known, but its protein sequence is similar orhomologous to a known AARS sequence with a known crystal structure. Insuch instances, it is expected that the conformation of the AARS inquestion will be similar to the known crystal structure of thehomologous AARS. The known structure may, therefore, be used as thestructure for the AARS of interest, or more preferably, may be used topredict the structure of the AARS of interest (i.e., in “homologymodeling” or “molecular modeling”). As a particular example, theMolecular Modeling Database (MMDB) described above (see, Wang et al.,Nucl. Acids Res. 2000, 28:243–245; Marchler-Bauer et al., Nucl. AcidsRes. 1999, 27:240–243) provides search engines that may be used toidentify proteins and/or nucleic acids that are similar or homologous toa protein sequence (referred to as “neighboring” sequences in the B),including neighboring sequences whose three-dimensional structures areknown. The database further provides links to the known structures alongwith alignment and visualization tools, such as Cn3D (developed byNCBI), RasMol, etc., whereby the homologous and parent sequences may becompared and a structure may be obtained for the parent sequence basedon such sequence alignments and known structures.

The homologous AARS sequence with known 3D-structure is preferably atleast about 60%, or at least about 70%, or at least about 80%, or atleast about 90%, or at least about 95% identical to the AARS of interestin the active site region or the pocket region for amino acid substratebinding. Such active site or pocket site may not be continuous in theprimary amino acid sequence of the AARS since distant amino acids maycome together in the 3D-structure. In this case, sequence homology oridentity can be calculated using, for example, the NCBI standard BLASTpprograms for protein using default conditions, in regions alignedtogether (without insertions or deletions in either of the two sequencesbeing compared) and including residues known to be involved in substrateamino acid binding. For example, the Thermus Aquaticus Phenylalanyl tRNASynthetase alpha subunit appears to have an “insert” region fromresidues 156 to 165 when compared to its homologs from other species.This region can be disregarded in calculating sequence identity.Alternatively, the homologous AARS is preferably about 35%, or 40%, or45%, or 50%, or 55% identical overall to the AARS of interest. The E.coli Phenylalanyl tRNA Synthetase alpha subunit is about 45% identicaloverall, and about 80% identical in the active site region to theThermus Aquaticus Phenylalanyl tRNA Synthetase. The human PhenylalanyltRNA Synthetase alpha subunits is about 62%, 60%, 54%, 50% identicaloverall to its Drosophila, worm (C. elegans), plant (Arabidopsisthaliana), yeast (S. cerevisiae) counterparts, respectively.

In the few cases where the structure for a particular AARS sequence maynot be known or available, it is typically possible to determine thestructure using routine experimental techniques (for example, X-raycrystallography and Nuclear Magnetic Resonance (NMR) spectroscopy) andwithout undue experimentation. See, e.g., NMR of Macromolecules: APractical Approach, G. C. K. Roberts, Ed., Oxford University Press Inc.,New York (1993); Ishima R, Torchia D A, “Protein Dynamics from NMR,” NatStruct Biol, 7: 740–743 (2000); Gardner K H, Kay L E, “The use of H-2,C-13, N-15 multidimensional NMR to study the structure and dynamics ofproteins,” Annu. Rev. Bioph. Biom., 27: 357–406 (1998); Kay L E, “NMRmethods for the study of protein structure and dynamics,” Biochern CellBiol, 75: 1–15 (1997); Dayie K T, Wagner G, Lefevre J F, “Theory andpractice of nuclear spin relaxation in proteins,” Annu Rev Phys Chem,47: 243–282 (1996); Wuthrich K, “NMR—This and other methods for proteinand nucleic-acid structure determination,” Acta Cyrstallogr. D, 51:249–270 (1995); Kahn R, Carpentier P, Berthet-Colominas C, et al.,“Feasibility and review of anomalous X-ray diffraction at longwavelengths in materials research and protein crystallography,” J.Synchrotron Radiat., 7: 131–138 (2000); Oakley A J, Wilce M C J,“Macromolecular cyrstallography as a tool for investigating drug, enzymeand receptor interactions,” Clin. Exp. Pharmacol. P.,27:145–151 (2000);Fourme R, Shepard W, Schiltz M, et al., “Better structures from betterdata through better methods: a review of developments in de novomacromolecular phasing techniques and associated instrumentation atLURE,” J. Synchrotron Radiat., 6: 834–844 (1999).

Alternatively, and in less preferable embodiments, the three-dimensionalstructure of a AARS sequence may be calculated from the sequence itselfand using ah initio molecular modeling techniques already known in theart. See e.g., Smith T F, LoConte L, Bienkowska J, et al., “Currentlimitations to protein threading approaches,” J. Comput. Biol., 4:217–225 (1997); Eisenhaber F, Frommel C, Argos P, “Prediction ofsecondary structural content of proteins from their amino acidcomposition alone 2. The paradox with secondary structural class,”Proteins, 24: 169–179 (1996); Bohm G, “New approaches in molecularstructure prediction,” Biophys Chem., 59: 1–32 (1996); Fetrow J S,Bryant S H, “New programs for protein tertiary structure prediction,”BioTechnol., 11: 479–484, (1993); Swindells M B, Thorton J M, “Structureprediction and modeling,” Curr. Opin. Biotech., 2: 512–519 (1991);Levitt M, Gerstein M, Huang E, et al., “Protein folding: The endgame,”Annu. Rev. Biochem., 66: 549–579 (1997). Eisenhaber F., Persson B.,Argos P., “Protein-structure prediction—recognition of primary,secondary, and tertiary structural features from amino-acid-sequence,”Crit Rev Biochem Mol, 30:1–94(1995); Xia Y, Huang E S, Levitt M, et al.,“Ab initio construction of protein tertiary structures using ahierarchical approach,” J. Mol. Biol., 300:171–185 (2000); Jones D T,“Protein structure prediction in the post genomicera,” Curr Opin StrucBiol, 10: 371–379 (2000). Three-dimensional structures obtained from abinitio modeling are typically less reliable than structures obtainedusing empirical (e.g., NMR spectroscopy or X-ray crystallography) orsemi-empirical (e.g., homology modeling) techniques. However, suchstructures will generally be of sufficient quality, although lesspreferred, for use in the methods of this invention.

For additional details, see section G below.

B. Binding Site Design

The protein design algorithm ORBIT, described in Dahiyat and Mayo(Protein Sci 5(5): 895–903, 1996; and Science 278: 82–87, 1997, entirecontents incorporated herein by reference) and Dahiyat et al. (J MolBiol 273(4): 789–96, 1997, entire content incorporated herein byreference) can be used to predict the optimal amino acid sequences ofthe binding pocket for binding to the different analogs. Although othersimilar or equivalent algorithms may also be used for the same purposewith minor modification. Selection of amino acids is performed using avery efficient DEE (Dead-End Elimination)-related search algorithm (seebelow) that relies on a discrete set of allowed conformations(“rotamers”) for each side chain and empirical potential energyfunctions that are used to calculate pair-wise interactions between sidechains and the between the side chains and backbone.

Surveys of protein structure database have shown that side chainsexhibit marked conformational preferences, and that most side chains arelimited to a small number of torsional angles. Thus, the torsionalflexibility of most amino acids can be represented with a discrete setof allowed conformations called rotamers. Backbone-dependent rotamericpreferences in side chains are observed in crystal structures, based onthe dependency of the rotamers on the main-chain conformations. ORBITaccounts for the torsional flexibilities of side chains by providingrotamer libraries that are based on those developed by Dunbrack andKarplus (Dunbrack and Karplus, J Mol Biol 230(2): 543–74, 1993; Dunbrackand Karplus, Nat Struct Biol 1(5): 334–40 1994, entire contents of whichare all incoporated herein by reference).

In our design, we would like to optimize the binding pocket of an AARSfor binding to the intended analogs. In performing the optimizationcalculations, we would like to vary the torsional angles of the intendedanalogs and side chains lining the pocket simultaneously. This requiresgenerating rotamer libraries for the analogs, since they are notincluded in the standard rotamer libraries. Backbone independent rotamerlibraries for all the analogs are generated as described below.

Since the residues in the pocket are buried in the protein structure, weused force field parameters similar to those used in protein core designcalculations. The design algorithm uses energy terms based on a forcefield that includes van der Waals interactions, electrostaticinteractions, hydrogen bonding, and salvation effects (see Gordon etal., Curr Opin Struct Biol 9(4): 509–13, 1999, entire contentincorporated herein by reference).

Based on the crystallographic data, residue positions in the AARS thatare involved in critical interactions between the AARS and the naturalamino acid backbone (the so-called “anchor residues”) are identified.These residue positions are fixed in identity and rotameric conformationin the ensuing design calculation. Meanwhile, residue positions in theAARS, especially those in the binding pocket that interact with the sidechain of the natural substrate amino acid (or its analog) areidentified. These residue positions are potential target positions forredesign.

Design calculations (see below) are run by fixing the identity of aparticular analog, and varying the target positions on the AARSdescribed above. Certain target positions may be allowed to be any ofthe 20 natural amino acids, with the possible exception of proline,methionine or cysteine. These amino acids may nevertheless by be allowedat those positions if the wild-type identity of these positions are Met,Pro, or Cys. At certain other target positions, only amino acids with acertain characteristic (such as small, large, hydorphobic, hydrophilic,aromatic, etc.) are allowed based on the need of the design. It isexpected that many of these target positions are buried in the core anda number of them may pack against the natural substrate in the crystalstructure.

Information from othert independent studies, such as mutation analysisat a specific position which has been shown to have altered substratespecificity may also help in the design process.

As described above, the anchor residues are held fixed both in identityand conformation in all the calculations. These residues make veryimportant electrostatic interactions with the substrate and thisinteraction is probably equally critical for the analogs. Theaminoadenylation step is required for the incorporation of all the aminoacids and hence, it may be important to make sure that the carbonyl andthe amide groups of the analog zwitterions are also anchored the sameway as the natural substrate at this site.

As an initial attempt, all the analog rotamers generated in the rotamerlibrary are allowed in the calculation. As a consequence, certainrotamers selected in the structure (AARS-analog) generated may not beburied in the binding pocket, and some part or most of the analog may besolvent-exposed. A second (or more subsequent) calculation(s) can thenbe run allowing only those analog rotamers that would pack into thebinding pocket. These are the rotamers with all possible combinations ofχ1 and χ2 of the natural substrate amino acid with a maximal of ±20 oftorsional angle variations, in increments of, say 1°, 2°, 3°, or 5°,etc. The structure generated in this calculation preferably will havethe suitable part of the analog buried in the pocket, and is preferablycompletely, or almost completely superimposable with the naturalsubstrate amino acid in the AARS crystal structure.

C. Design Calculations

The present invention utilizes an “inverse protein folding” approachdirected to the quantitative design and optimization of amino acidsequences, especially the AARS with its intended analog bound in thebinding pocket for its natural substrate amino acid. Similar to proteindesign, such approach seeks to find a sequence or set of sequences thatwill fold into a desired structure. These approaches can be contrastedwith a “protein folding” approach which attempts to predict a structuretaken by a given sequence. In a generallized approach, target AARSresidue positions that is selected for redesign are determined based onthe criteria described above. The side chains of any positions to bevaried are then removed. The resulting structure consisting of theprotein backbone (including the fixed backbone structure of the analog)and the remaining sidechains is called the template. Each variableresidue position can then be optionally classified as a core residue, asurface residue, or a boundary residue. In that case, eachclassification defines a subset of possible amino acid residues for theposition (for example, core residues generally will be selected from theset of hydrophobic residues, surface residues generally will be selectedfrom the hydrophilic residues, and boundary residues may be either).Each amino acid residue or the analog can be represented by a discreteset of all allowed conformers of each side chain, called rotamers. Thus,to arrive at an optimal sequence for a backbone, all possible sequencesof rotamers must be screened, where each backbone position can beoccupied either by each amino acid in all its possible rotameric states,or a subset of amino acids, and thus a subset of rotamers. For thispurpose, the analog backbone is treated as part of the AARS template.

Two sets of interactions are then calculated for each rotamer at everyposition: the interaction of the rotamer side chain with all or part ofthe backbone (the “singles” energy, also called the rotamer/template orrotamer/backbone energy), and the interaction of the rotamer side chainwith all other possible rotamers at every other position or a subset ofthe other positions (the “doubles” energy, also called therotamer/rotamer energy). The energy of each of these interactions iscalculated through the use of a variety of scoring functions, whichinclude the energy of van der Waal's forces, the energy of hydrogenbonding, the energy of secondary structure propensity, the energy ofsurface area salvation and the electrostatics (see Gordon et al.,supra). Thus, the total energy of each rotamer interaction, both withthe backbone and other rotamers, is calculated, and stored in a matrixform.

The discrete nature of rotamer sets allows a simple calculation of thenumber of rotamer sequences to be tested. A backbone of length n with mpossible rotamers per position will have m^(n) possible rotamersequences, a number which grows exponentially with sequence length andrenders the calculations either unwieldy or impossible in real time.Accordingly, to solve this combinatorial search problem, a “Dead EndElimination” (DEE) calculation is performed. The DEE calculation isbased on the fact that if the worst total interaction of a first rotameris still better than the best total interaction of a second rotamer,then the second rotamer cannot be part of the global optimum solution.Since the energies of all rotamers have already been calculated, the DEEapproach only requires sums over the sequence length to test andeliminate rotamers, which speeds up the calculations considerably. DEEcan be rerun comparing pairs of rotamers, or combinations of rotamers,which will eventually result in the determination of a single sequencewhich represents the global optimum energy.

Once the global solution has been found, a search (such as Monte Carlosearch) may be done to generate a rank-ordered list of sequences in theneighborhood of the DEE solution. Starting at the DEE solution, randompositions are changed to other rotamers, and the new sequence energy iscalculated. If the new sequence meets the criteria for acceptance, it isused as a starting point for another jump. After a predetermined numberof jumps, a rank-ordered list of sequences is generated. Typically, 10⁶jumps (steps) are used in a Monte Carlo search.

The results may then be experimentally verified by physically generatingone or more of the protein sequences followed by experimental testing.The information obtained from the testing can then be fed back into theanalysis, to modify the procedure if necessary.

D. Rotamers for Target Posision Residues and Generation ofBackbone-independent Rotamer Libraries for All the Analogs

Once the protein backbone structure (including the analog backbone) hasbeen selected and input, and the analog identity and variable residuepositions chosen, a group of potential rotamers for each of the variableresidue positions is established.

As is known in the art, each amino acid side chain has a set of possibleconformers, called rotamers. See Ponder, et al., Acad. Press Inc.(London) Ltd. pp. 775–791 (1987); Dunbrack, et al., Struc. Biol. 1(5):334–340 (1994); Desmet, et al., Nature 356: 539–542 (1992) all of whichare hereby expressly incorporated by reference in their entireity. Thus,a set of discrete rotamers for every amino acid side chain is used. Asdescribed above, there are two general types of rotamer libraries:backbone dependent and backbone independent. Either type of library canbe used at any position.

In addition, a preferred embodiment does a type of “fine tuning” of therotamer library by expanding the possible χ angle values of the rotamersby plus and minus one standard deviation (±1 SD) (or more) about themean value, in order to minimize possible errors that might arise fromthe discreteness of the library. This is particularly important foraromatic residues, and fairly important for hydrophobic residues, due tothe increased requirements for flexibility in the core and the rigidityof aromatic rings; it is not as important for the other residues. Thus apreferred embodiment expands the χ1 and χ2 angles for all amino acidsexcept Met, Arg and Lys. For the intended amino acid analogs, the χ1 andχ2 angles are expanded as such in their corresponding rotamers.

To roughly illustrate the numbers of rotamers, in one version of theDunbrack & Karplus backbone-dependent rotamer library, Ala has 1rotamer, Gly has 1 rotamer, Arg has 55 rotamers, The has 9 rotamers, Lyshas 57 rotamers, Glu has 69 rotamers, Asn has 54 rotamers, Asp has 27rotamers, Trp has 54 rotamers, Tyr has 36 rotamers, Cys has 9 rotamers,Gln has 69 rotamers, His has 54 rotamers, Val has 9 rotamers, Ile has 45rotamers, Leu has 36 rotamers, Mat has 21 rotamers, Ser has 9 rotamers,and Phe has 36 rotamers.

In general, proline is not generally used in a target position, since itwill rarely be chosen for any position, although it can be included ifdesired. Similarly, a preferred embodiment omits cysteine as aconsideration, only to avoid potential disulfide problems, although itcan be included if desired.

As will be appreciated by those in the art, other rotamer libraries withall dihedral angles staggered can be used or generated.

In a preferred embodiment, at a minimum, at least one variable positionhas rotamers from at least two different amino acid side chains; thatis, a sequence is being optimized, rather than a structure.

In a preferred embodiment, rotamers from all of the amino acids (or allof them except cysteine, glycine and proline) are used for each variableresidue position; that is, the group or set of potential rotamers ateach variable position is every possible rotamer of each amino acid.This is especially preferred when the number of variable positions isnot high as this type of analysis can be computationally expensive.

There is a special consideration for rotamers of amino acid analogs. Forall the natural amino acids, the possible χ1 and χ2 angles are derivedfrom database analysis. Since this is not feasible in the case ofartificial amino acids or analogs, the closest approximation for χ1 andχ2 angles for the analogs are taken to be the same as those for thenatural substrate amino acid. Moreover, we would like the analogs tohave a conformation as close as possible, yet with certain flexibility,to the orientation of the natural substrate amino acid in the bindingpocket. So allowing similar torsional angles for the analogs as ofnatural substrate seems logical.

Preferably, as described above, both the χ1 and χ2 torsional angles forthe analogs are varied to match those of the natural substrate rotamersin the standard backbone independent rotamer library. The torsionalangles of the natural substrate in the crystal structure are alsoincluded in the new rotamer libraries for both the natural substrate andthe analogs. Charges were assigned only to the heavy atoms of theanalogs to be consistent with the way charges for the natural aminoacids are represented in ORBIT.

E. Determining Conformational Energy

In certain embodiments, each variable position is classified as either acore, surface or boundary residue position, although in some cases, asexplained below, the variable position may be set to glycine to minimizebackbone strain.

The classification of residue positions as core, surface or boundary maybe done in several ways, as will be appreciated by those in the art. Ina preferred embodiment, the classification is done via a visual scan ofthe original protein backbone (including the bound analog) structure,including the side chains, and assigning a classification based on asubjective evaluation of one skilled in the art of protein modelling.Alternatively, a preferred embodiment utilizes an assessment of theorientation of the Cα–Cβ vectors relative to a solvent accessiblesurface computed using only the template Cα atoms. In a preferredembodiment, the solvent accessible surface for only the Cα atoms of thetarget fold is generated using the Connolly algorithm with a proberadius ranging from about 4 to about 12 Å, with from about 6 to about 10Å being preferred, and 8 Å being particularly preferred. The Cα radiusused ranges from about 1.6 Å to about 2.3 Å, with from about 1.8 toabout 2.1 Å being preferred, and 1.95 Å being especially preferred. Aresidue is classified as a core position if a) the distance for its Cα,along its Cα–Cβ vector, to the solvent accessible surface is greaterthan about 4–6 Å, with greater than about 5.0 Å being especiallypreferred, and b) the distance for its Cβ to the nearest surface pointis greater than about 1.5–3 Å, with greater than about 2.0 Å beingespecially preferred. The remaining residues are classified as surfacepositions if the sum of the distances from their Cα, along their Cα-Cβvector, to the solvent accessible surface, plus the distance from theirCβ to the closest surface point was less than about 2.5–4 Å, with lessthan about 2.7 Å being especially preferred. All remaining residues areclassified as boundary positions. For example, residues in the bindingpocket are buried in the protein structure, force field parameterssimilar to those used in protein core design calculations can be usedwhen calculating these residues.

Once each variable position is classified as either core, surface orboundary, a set of amino acid side chains, and thus a set of rotamers,is assigned to each position. That is, the set of possible amino acidside chains that the program will allow to be considered at anyparticular position is chosen. Subsequently, once the possible aminoacid side chains are chosen, the set of rotamers that will be evaluatedat a particular position can be determined. Thus, a core residue willgenerally be selected from the group of hydrophobic residues consistingof alanine, valine, isoleucine, leucine, phenylalanine, tyrosine,tryptophan, and methionine (in some embodiments, when the α scalingfactor of the van der Waals scoring function, described below, is low,methionine is removed from the set), and the rotamer set for each coreposition potentially includes rotamers for these eight amino acid sidechains (all the rotamers if a backbone independent library is used, andsubsets if a rotamer dependent backbone is used). Similarly, surfacepositions are generally selected from the group of hydrophilic residuesconsisting of alanine, serine, threonine, aspartic acid, asparagine,glutamine, glutamic acid, arginine, lysine and histidine. The rotamerset for each surface position thus includes rotamers for these tenresidues. Finally, boundary positions are generally chosen from alanine,serine, threonine, aspartic acid, asparagine, glutamine, glutamic acid,arginine, lysine histidine, valine, isoleucine, leucine, phenylalanine,tyrosine, tryptophan, and methionine. The rotamer set for each boundaryposition thus potentially includes every rotamer for these seventeenresidues (assuming cysteine, glycine and proline are not used, althoughthey can be).

Thus, as will be appreciated by those in the art, there is acomputational benefit to classifying the residue positions, as itdecreases the number of calculations. It should also be noted that theremay be situations where the sets of core, boundary and surface residuesare altered from those described above; for example, under somecircumstances, one or more amino acids is either added or subtractedfrom the set of allowed amino acids. For example, some proteins whichdimerize or multimerize, or have ligand binding sites, may containhydrophobic surface residues, etc. In addition, residues that do notallow helix “capping” or the favorable interaction with an α-helixdipole may be subtracted from a set of allowed residues. Thismodification of amino acid groups is done on a residue by residue basis.

In a preferred embodiment, proline, cysteine and glycine are notincluded in the list of possible amino acid side chains, and thus therotamers for these side chains are not used. However, in a preferredembodiment, when the variable residue position has a Φ angle (that is,the dihedral angle defined by 1) the carbonyl carbon of the precedingamino acid; 2) the nitrogen atom of the current residue; 3) the α-carbonof the current residue; and 4) the carbonyl carbon of the currentresidue) greater than 0°, the position is set to glycine to minimizebackbone strain.

Once a three-dimensional structure has been obtained or otherwiseprovided for the AARS sequence as described above, a fitness value forthe protein may be obtained by calculating or determining the“conformational energy” or “energy” E of the protein structure. Inparticular and without being limited to any particular theory ormechanism of action, sequences that have a lower (i.e., more negative)conformational energy are typically expected to be more stable andtherefore more “fit” than are sequences having higher (i.e., lessnegative) conformation energy. Thus, the fitness of a sequence ispreferably related to its negative conformational energy; i.e., F=−E.

Typically, the conformational energy is calculated ab initio from theconformation determination discussed above, and using an empirical orsemi-empirical force field such as CHARM (Brooks et al., J. Comp. Chem.1983, 4: 187–217; MacKerell et al., in The Encyclopedia of ComputationalChemistry, Vol. 1:271–277, John Wiley & Sons, Chichester, 1998) AMBER(see, Cornell et al., J. Amer Chem. Soc. 1995, 117:5179; Woods et al.,J. Phys. Chem. 1995, 99:3832–3846; Weiner et al., J. Comp. Chem. 1986,7:230; and Weiner et al., J. Amer. Chem. Soc. 1984, 106:765) andDREIDING (Mayo et al., J. Phys. Chem. 1990, 94:8897) to name a few.These and other such force-fields comprise a number of potential scoringfunctions and parameters for at least approximate contributions ofvarious interactions within a macromolecule.

The scoring functions include a van der Waals potential scoringfunction, a hydrogen bond potential scoring function, an atomicsolvation scoring function, a secondary structure propensity scoringfunction and an electrostatic scoring function. As is further describedbelow, at least one scoring function is used to score each position,although the scoring functions may differ depending on the positionclassification or other considerations, like favorable interaction withan α-helix dipole. As outlined below, the total energy which is used inthe calculations is the sum of the energy of each scoring function usedat a particular position, as is generally shown in Equation 1:E _(total) =nE _(vdw) +nE _(as) +nE _(h-bonding) +nE _(ss) +nE_(elec)  (Equation 1)

In Equation 1, the total energy is the sum of the energy of the van derWaals potential (E_(vdw)), the energy of atomic solvation (E_(as)), theenergy of hydrogen bonding (E_(h-bonding)), the energy of secondarystructure (E_(ss)) and the energy of electrostatic interaction(E_(elec)). The term n is either 0 or 1, depending on whether the termis to be considered for the particular residue position, as is morefully outlined below.

In a preferred embodiment, a van der Waals' scoring function is used. Asis known in the art, van der Waals' forces are the weak, non-covalentand non-ionic forces between atoms and molecules, that is, the induceddipole and electron repulsion (Pauli principle) forces. The van derWaals scoring function is based on a van der Waals potential energy.There are a number of van der Waals potential energy calculations,including a Lennard-Jones 12–6 potential with radii and well depthparameters from the Dreiding force field, Mayo et al., J. Prot. Chem.,1990, expressly incorporated herein by reference, or the exponential 6potential. Equation 2, shown below, is the preferred Lennard-Jonespotential:E _(vdw) =D ₀{(R ₀ /R)¹²−2(R ₀ /R)⁶}  (Equation 2)

R₀ is the geometric mean of the van der Waals radii of the two atomsunder consideration, and D₀ is the geometric mean of the well depth ofthe two atoms under consideration. E_(vdw) and R are the energy andinteratomic distance between the two atoms under consideration, as ismore fully described below.

In a preferred embodiment, the van der Waals forces are scaled using ascaling factor, α. Equation 3 shows the use of α in the van der WaalsLennard-Jones potential equation:E _(vdw) =D ₀{(αR ₀ /R)¹²−2(αR ₀ /R)⁶}  (Equation 3)The role of the α scaling factor is to change the importance of packingeffects in the optimization and design of any particular protein. Asdiscussed in the Examples, different values for α result in differentsequences being generated by the present methods. Specifically, areduced van der Waals steric constraint can compensate for therestrictive effect of a fixed backbone and discrete side-chain rotamersin the simulation and can allow a broader sampling of sequencescompatible with a desired fold. In a preferred embodiment, α valuesranging from about 0.70 to about 1.10 can be used, with α values fromabout 0.8 to about 1.05 being preferred, and from about 0.85 to about1.0 being especially preferred. Specific α values which are preferredare 0.80, 0.85, 0.90, 0.95, 1.00, and 1.05.

Generally speaking, variation of the van der Waals scale factor αresults in four regimes of packing specificity: regime 1 where0.9<=α<=1.05 and packing constraints dominate the sequence selection;regime 2 where 0.8<=α<0.9 and the hydrophobic solvation potential beginsto compete with packing forces; regime 3 where α<0.8 and hydrophobicsolvation dominates the design; and, regime 4 where α>1.05 and van derWaals repulsions appear to be too severe to allow meaningful sequenceselection. In particular, different α values may be used for core,surface and boundary positions, with regimes 1 and 2 being preferred forcore residues, regime 1 being preferred for surface residues, and regime1 and 2 being preferred for boundary residues.

In a preferred embodiment, the van der Waals scaling factor is used inthe total energy calculations for each variable residue position,including core, surface and boundary positions.

In a preferred embodiment, an atomic salvation potential scoringfunction is used. As is appreciated by those in the art, solventinteractions of a protein are a significant factor in protein stability,and residue/protein hydrophobicity has been shown to be the majordriving force in protein folding. Thus, there is an entropic cost tosolvating hydrophobic surfaces, in addition to the potential formisfolding or aggregation. Accordingly, the burial of hydrophobicsurfaces within a protein structure is beneficial to both folding andstability. Similarly, there can be a disadvantage for buryinghydrophilic residues. The accessible surface area of a protein atom isgenerally defined as the area of the surface over which a water moleculecan be placed while making van der Waals contact with this atom and notpenetrating any other protein atom. Thus, in a preferred embodiment, thesolvation potential is generally scored by taking the total possibleexposed surface area of the moiety or two independent moieties (either arotamer or the first rotamer and the second rotamer), which is thereference, and subtracting out the “buried” area, i.e. the area which isnot solvent exposed due to interactions either with the backbone or withother rotamers. This thus gives the exposed surface area.

Alternatively, a preferred embodiment calculates the scoring function onthe basis of the “buried” portion; i.e. the total possible exposedsurface area is calculated, and then the calculated surface area afterthe interaction of the moieties is subtracted, leaving the buriedsurface area. A particularly preferred method does both of thesecalculations.

As is more fully described below, both of these methods can be done in avariety of ways. See Eisenberg et al., Nature 319:199–203 (1986);Connolly, Science 221:709–713 (1983); and Wodak, et al., Proc. Natl.Acad. Sci. USA 77(4):1736–1740 (1980), all of which are expresslyincorporated herein by reference. As will be appreciated by those in theart, this solvation potential scoring function is conformationdependent, rather than conformation independent.

In a preferred embodiment, the pairwise salvation potential isimplemented in two components, “singles” (rotamer/template) and“doubles” (rotamer/rotamer), as is more fully described below. For therotamer/template buried area, the reference state is defined as therotamer in question at residue position i with the backbone atoms onlyof residues i−1, i and i+1, although in some instances just i may beused. Thus, in a preferred embodiment, the salvation potential is notcalculated for the interaction of each backbone atom with a particularrotamer, although more may be done as required. The area of the sidechain is calculated with the backbone atoms excluding solvent but notcounted in the area. The folded state is defined as the area of therotamer in question at residue i, but now in the context of the entiretemplate structure including non-optimized side chains, i.e. every otherfoxed position residue. The rotamer/template buried area is thedifference between the reference and the folded states. Therotamer/rotamer reference area can be done in two ways; one by usingsimply the sum of the areas of the isolated rotamers; the secondincludes the full backbone. The folded state is the area of the tworotamers placed in their relative positions on the protein scaffold butwith no template atoms present. In a preferred embodiment, the Richardsdefinition of solvent accessible surface area (Lee and Richards, J. Mol.Biol. 55:379–400, 1971, hereby incorporated by reference) is used, witha probe radius ranging from 0.8 to 1.6 Å, with 1.4 Å being preferred,and Drieding van der Waals radii, scaled from 0.8 to 1.0. Carbon andsulfur, and all attached hydrogens, are considered nonpolar. Nitrogenand oxygen, and all attached hydrogens, are considered polar. Surfaceareas are calculated with the Connolly algorithm using a dot density of10 Å-2 (Connolly, (1983) (supra), hereby incorporated by reference).

In a preferred embodiment, there is a correction for a possibleoverestimation of buried surface area which may exist in the calculationof the energy of interaction between two rotamers (but not theinteraction of a rotamer with the backbone). Since, as is generallyoutlined below, rotamers are only considered in pairs, that is, a firstrotamer is only compared to a second rotamer during the “doubles”calculations, this may overestimate the amount of buried surface area inlocations where more than two rotamers interact, that is, where rotamersfrom three or more residue positions come together. Thus, a correctionor scaling factor is used as outlined below. The general energy ofsolvation is shown in Equation 4:E _(sa) =f(SA)  (Equation 4)

-   -   where E_(sa) is the energy of solvation, f is a constant used to        correlate surface area and energy, and SA is the surface area.        This equation can be broken down, depending on which parameter        is being evaluated. Thus, when the hydrophobic buried surface        area is used, Equation 5 is appropriate:        E _(sa) =f ₁(SA _(buried hydrophobic))  (Equation 5)

where f₁ is a constant which ranges from about 10 to about 50cal/mol/Å², with 23 or 26 cal/mol/Å² being preferred. When a penalty forhydrophilic burial is being considered, the equation is shown inEquation 6:E _(sa) =f ₁(SA _(buried hydrophobic))+f ₂(SA_(buried hydrophilic))  (Equation 6)

-   -   where f₂ is a constant which ranges from −50 to −250 cal/mol/Å²,        with −86 or −100 cal/mo/Å² being preferred. Similarly, if a        penalty for hydrophobic exposure is used, equation 7 or 8 may be        used:        E _(sa) =f ₁(SA _(buried hydrophobic))+f ₃(SA        _(exposed hydrophobic))  (Equation 7)

$\begin{matrix}{E_{sa} = {{f_{1}\left( {SA}_{{buried}\mspace{14mu}{hydrophobic}} \right)} + {f_{2}\left( {SA}_{{buried}\mspace{14mu}{hydrophilic}} \right)} + {f_{3}\left( {SA}_{{exposed}\mspace{14mu}{hydrophobic}} \right)} + {f_{4}\left( {SA}_{{exposed}\mspace{14mu}{hydrophillc}} \right)}}} & \text{(Equation~~8)}\end{matrix}$

In a preferred embodiment, f₃=−f₁.

In one embodiment, backbone atoms are not included in the calculation ofsurface areas, and values of 23 cal/mol/Å² (f₁) and −86 cal/mol/Å² (f₂)are determined.

In a preferred embodiment, this overcounting problem is addressed usinga scaling factor that compensates for only the portion of the expressionfor pairwise area that is subject to overcounting. In this embodiment,values of −26 cal/mol/Å² (f₁) and 100 cal/mol/Å² (f₂) are determined.

Atomic solvation energy is expensive, in terms of computational time andresources. Accordingly, in a preferred embodiment, the solvation energyis calculated for core and/or boundary residues, but not surfaceresidues, with both a calculation for core and boundary residues beingpreferred, although any combination of the three is possible.

In a preferred embodiment, a hydrogen bond potential scoring function isused. A hydrogen bond potential is used as predicted hydrogen bonds docontribute to designed protein stability (see Stickle et al., J. Mol.Biol. 226:1143 (1992); Huyghues-Despointes et al., Biochem. 34:13267(1995), both of which are expressly incorporated herein by reference).As outlined previously, explicit hydrogens are generated on the proteinbackbone structure.

In a preferred embodiment, the hydrogen bond potential consists of adistance-dependent term and an angle-dependent term, as shown inEquation 9:E _(H-Bonding) =D ₀{5(R ₀ /R)¹²−6(R ₀ R)¹⁰ }F(θ, φ, ψ)  (Equation 9)

where R₀ (2.8 Å) and D₀ (8 kcal/mol) are the hydrogen-bond equilibriumdistance and well-depth, respectively, and R is the donor to acceptordistance. This hydrogen bond potential is based on the potential used inDREIDING with more restrictive angle-dependent terms to limit theoccurrence of unfavorable hydrogen bond geometries. The angle termvaries depending on the hybridization state of the donor and acceptor,as shown in Equations 10, 11, 12 and 13. Equation 10 is used for sp³donor to sp³ acceptor; Equation 11 is used for sp³ donor to sp²acceptor, Equation 12 is used for sp² donor to sp³ acceptor, andEquation 13 is used for sp² donor to sp² acceptor:F=cos² θ cos²(φ−109.5)  (Equation 10)F=cos² θ cos² φ  (Equation 11)F=cos⁴ θ  (Equation 12)F=cos² θ cos²(max[Φ, φ])  (Equation 13)

In Equations 10–13, θ is the donor-hydrogen-acceptor angle, Φ is thehydrogen-acceptor-base angle (the base is the atom attached to theacceptor, for example the carbonyl carbon is the base for a carbonyloxygen acceptor), and Φ is the angle between the normals of the planesdefined by the six atoms attached to the sp² centers (the supplement ofφ is used when φ is less than 90°). The hydrogen-bond function is onlyevaluated when 2.6 Å<=R<=3.2 Å, θ>90°, Φ−109.5°<90° for the sp³donor—sp³ acceptor case, and, Φ>90° for the sp³ donor—sp² acceptor case,preferably, no switching functions are used. Template donors andacceptors that are involved in template-template hydrogen bonds arepreferably not included in the donor and acceptor lists. For the purposeof exclusion, a template-template hydrogen bond is considered to existwhen 2.5 Å<=R<=3.3 Å and θ>=135°.

The hydrogen-bond potential may also be combined or used with a weakcoulombic term that includes a distance-dependent dielectric constant of40R, where R is the interatomic distance. Partial atomic charges arepreferably only applied to polar functional groups. A net formal chargeof +1 is used for Arg and Lys and a net formal charge of −1 is used forAsp and Glu; see Gasteiger, et al., Tetrahedron 36:3219–3288 (1980);Rappe, et al., J. Phys. Chem. 95:3358–3363 (1991).

In a preferred embodiment, an explicit penalty is given for buried polarhydrogen atoms which are not hydrogen bonded to another atom. SeeEisenberg, et al., (1986) (supra), hereby expressly incorporated byreference. In a preferred embodiment, this penalty for polar hydrogenburial, is from about 0 to about 3 kcal/mol, with from about 1 to about3 being preferred and 2 kcal/mol being particularly preferred. Thispenalty is only applied to buried polar hydrogens not involved inhydrogen bonds. A hydrogen bond is considered to exist when E_(HB)ranges from about 1 to about 4 kcal/mol, with E_(HB) of less than −2kcal/mol being preferred. In addition, in a preferred embodiment, thepenalty is not applied to template hydrogens, i.e. unpaired buriedhydrogens of the backbone.

In a preferred embodiment, only hydrogen bonds between a first rotamerand the backbone are scored, and rotamer-rotamer hydrogen bonds are notscored. In an alternative embodiment, hydrogen bonds between a firstrotamer and the backbone are scored, and rotamer-rotamer hydrogen bondsare scaled by 0.5.

In a preferred embodiment, the hydrogen bonding scoring function is usedfor all positions, including core, surface and boundary positions. Inalternate embodiments, the hydrogen bonding scoring function may be usedon only one or two of these.

In a preferred embodiment, a secondary structure propensity scoringfunction is used. This is based on the specific amino acid side chain,and is conformation independent. That is, each amino acid has a certainpropensity to take on a secondary structure, either α-helix or β-sheet,based on its Φ and ψ angles. See Munoz et al., Current Op. in Biotech.6:382 (1995); Minor, et al., Nature 367:660–663 (1994); Padmanabhan, etal., Nature 344:268–270 (1990); Munoz, et al., Folding & Design1(3):167–178 (1996); and Chakrabartty, et al., Protein Sci. 3:843(1994), all of which are expressly incorporated herein by reference.Thus, for variable residue positions that are in recognizable secondarystructure in the backbone, a secondary structure propensity scoringfunction is preferably used. That is, when a variable residue positionis in an α-helical area of the backbone, the α-helical propensityscoring function described below is calculated. Whether or not aposition is in an α-helical area of the backbone is determined as willbe appreciated by those in the art, generally on the basis of Φ and ψangles; for α-helix, Φ angles from −2 to −70 and ψ angles from −30 to−100 generally describe an α-helical area of the backbone.

Similarly, when a variable residue position is in a α-sheet backboneconformation, the β-sheet propensity scoring function is used. β-sheetbackbone conformation is generally described by Φ angles from −30 to−100 and χ angles from +40 to +180. In alternate preferred embodiments,variable residue positions which are within areas of the backbone whichare not assignable to either β-sheet or .alpha.-helix structure may alsobe subjected to secondary structure propensity calculations.

In a preferred embodiment, energies associated with secondarypropensities are calculated using Equation 14:E=10^(Nss(ΔG°aa−ΔG°ala))−1  (Equation 14)

In Equation 14, Eα (or Eβ) is the energy of α-helical propensity,ΔG°_(aa) is the standard free energy of helix propagation of the aminoacid, and ΔG°_(ala) is the standard free energy of helix propagation ofalanine used as a standard, or standard free energy of β-sheet formationof the amino acid, both of which are available in the literature (seeChakrabartty, et al., (1994) (supra), and Munoz, et al., Folding &Design 1(3):167–178 (1996)), both of which are expressly incorporatedherein by reference), and N_(ss) is the propensity scale factor which isset to range from 1 to 4, with 3.0 being preferred. This potential ispreferably selected in order to scale the propensity energies to asimilar range as the other terms in the scoring function.

In a preferred embodiment, β-sheet propensities are preferablycalculated only where the i−1 and i+1 residues are also in β-sheetconformation.

In a preferred embodiment, the secondary structure propensity scoringfunction is used only in the energy calculations for surface variableresidue positions. In alternate embodiments, the secondary structurepropensity scoring function is used in the calculations for core andboundary regions as well.

In a preferred embodiment, an electrostatic scoring function is used, asshown below in Equation 15:E _(elec) =qq′/εr ²  (Equation 15)

In this Equation, q is the charge on atom 1, q′ is charge on atom 2, andr is the interaction distance.

In a preferred embodiment, at least one scoring function is used foreach variable residue position; in preferred embodiments, two, three orfour scoring functions are used for each variable residue position.

Once the scoring functions to be used are identified for each variableposition, the preferred first step in the computational analysiscomprises the determination of the interaction of each possible rotamerwith all or part of the remainder of the protein. That is, the energy ofinteraction, as measured by one or more of the scoring functions, ofeach possible rotamer at each variable residue position with either thebackbone or other rotamers, is calculated. In a preferred embodiment,the interaction of each rotamer with the entire remainder of theprotein, i.e. both the entire template and all other rotamers, is done.However, as outlined above, it is possible to only model a portion of aprotein, for example a domain of a larger protein, and thus in somecases, not all of the protein need be considered.

In a preferred embodiment, the first step of the computationalprocessing is done by calculating two sets of interactions for eachrotamer at every position: the interaction of the rotamer side chainwith the template or backbone (the “singles” energy), and theinteraction of the rotamer side chain with all other possible rotamersat every other position (the “doubles” energy), whether that position isvaried or floated. It should be understood that the backbone in thiscase includes both the atoms of the protein structure backbone, as wellas the atoms of any fixed residues, wherein the fixed residues aredefined as a particular conformation of an amino acid or analogbackbone.

Thus, “singles” (rotamer/template) energies are calculated for theinteraction of every possible rotamer at every variable residue positionwith the backbone, using some or all of the scoring functions. Thus, forthe hydrogen bonding scoring function, every hydrogen bonding atom ofthe rotamer and every hydrogen bonding atom of the backbone isevaluated, and the E_(HB) is calculated for each possible rotamer atevery variable position. Similarly, for the van der Waals scoringfunction, every atom of the rotamer is compared to every atom of thetemplate (generally excluding the backbone atoms of its own residue),and the E_(vdW) is calculated for each possible rotamer at everyvariable residue position. In addition, generally no van der Waalsenergy is calculated if the atoms are connected by three bonds or less.For the atomic solvation scoring function, the surface of the rotamer ismeasured against the surface of the template, and the E_(as) for eachpossible rotamer at every variable residue position is calculated. Thesecondary structure propensity scoring function is also considered as asingles energy, and thus the total singles energy may contain an E_(ss)term. As will be appreciated by those in the art, many of these energyterms will be close to zero, depending on the physical distance betweenthe rotamer and the template position; that is, the farther apart thetwo moieties, the lower the energy.

Accordingly, as outlined above, the total singles energy is the sum ofthe energy of each scoring function used at a particular position, asshown in Equation 1, wherein n is either 1 or zero, depending on whetherthat particular scoring function was used at the rotamer position:E _(total) =nE _(vdw) +nE _(as) +nE _(h-bonding) +nE _(ss) +nE_(elec)  (Equation 1)

Once calculated, each singles E_(total) for each possible rotamer isstored, such that it may be used in subsequent calculations, as outlinedbelow.

For the calculation of “doubles” energy (rotamer/rotamer), theinteraction energy of each possible rotamer is compared with everypossible rotamer at all other variable residue positions. Thus,“doubles” energies are calculated for the interaction of every possiblerotamer at every variable residue position with every possible rotamerat every other variable residue position, using some or all of thescoring functions. Thus, for the hydrogen bonding scoring function,every hydrogen bonding atom of the first rotamer and every hydrogenbonding atom of every possible second rotamer is evaluated, and theE_(EB) is calculated for each possible rotamer pair for any two variablepositions. Similarly, for the van der Waals scoring function, every atomof the first rotamer is compared to every atom of every possible secondrotamer, and the E_(vdW) is calculated for each possible rotamer pair atevery two variable residue positions. For the atomic solvation scoringfunction, the surface of the first rotamer is measured against thesurface of every possible second rotamer, and the E_(as) for eachpossible rotamer pair at every two variable residue positions iscalculated. The secondary structure propensity scoring function need notbe run as a “doubles” energy, as it is considered as a component of the“singles” energy. As will be appreciated by those in the art, many ofthese double energy terms will be close to zero, depending on thephysical distance between the first rotamer and the second rotamer; thatis, the farther apart the two moieties, the lower the energy.

Accordingly, as outlined above, the total doubles energy is the sum ofthe energy of each scoring function used to evaluate every possible pairof rotamers, as shown in Equation 16, wherein n is either 1 or zero,depending on whether that particular scoring function was used at therotamer position:E _(total) =nE _(vdw) +nE _(as) +nE _(h-bonding) +E _(elec)  (Equation16)

An example is illuminating. A first variable position, i, has three (anunrealistically low number) possible rotamers (which may be either froma single amino acid or different amino acids) which are labelled ia, ib,and ic. A second variable position, j, also has three possible rotamers,labelled jd, je, and jf. Thus, nine doubles energies (E_(total)) arecalculated in all: E_(total) (ia, jd), E_(total) (ia, je), Et_(total)(ia, jf), E_(total) (ib, jd), E_(total) (ib, je), E_(total) (ib, Jf),E_(total) (ic, je) and Et_(total) (ic, if).

Once calculated, each doubles E_(total) for each possible rotamer pairis stored, such that it may be used in subsequent calculations, asoutlined below.

Once the singles and doubles energies are calculated and stored, thenext step of the computational processing may occur. Generally speaking,the goal of the computational processing is to determine a set ofoptimized protein sequences. By “optimized protein sequence” herein ismeant a sequence that best fits the mathematical equations herein. Aswill be appreciated by those in the art, a global optimized sequence isthe one sequence that best fits Equation 1, i.e. the sequence that hasthe lowest energy of any possible sequence. However, there are anynumber of sequences that are not the global minimum but that have lowenergies.

In a preferred embodiment, the set comprises the globally optimalsequence in its optimal conformation, i.e. the optimum rotamer at eachvariable position. That is, computational processing is run until thesimulation program converges on a single sequence which is the globaloptimum.

In a preferred embodiment, the set comprises at least two optimizedprotein sequences. Thus for example, the computational processing stepmay eliminate a number of disfavored combinations but be stopped priorto convergence, providing a set of sequences of which the global optimumis one. In addition, further computational analysis, for example using adifferent method, may be run on the set, to further eliminate sequencesor rank them differently. Alternatively, as is more fully describedbelow, the global optimum may be reached, and then further computationalprocessing may occur, which generates additional optimized sequences inthe neighborhood of the global optimum.

If a set comprising more than one optimized protein sequences isgenerated, they may be rank ordered in terms of theoretical quantitativestability, as is more fully described below.

In a preferred embodiment, the computational processing step firstcomprises an elimination step, sometimes referred to as “applying acutoff”, either a singles elimination or a doubles elimination. Singleselimination comprises the elimination of all rotamers with templateinteraction energies of greater than about 10 kcal/mol prior to anycomputation, with elimination energies of greater than about 15 kcal/molbeing preferred and greater than about 25 kcal/mol being especiallypreferred. Similarly, doubles elimination is done when a rotamer hasinteraction energies greater than about 10 kcal/mol with all rotamers ata second residue position, with energies greater than about 15 beingpreferred and greater than about 25 kcal/mol being especially preferred.

In a preferred embodiment, the computational processing comprises directdetermination of total sequence energies, followed by comparison of thetotal sequence energies to ascertain the global optimum and rank orderthe other possible sequences, if desired. The energy of a total sequenceis shown below in Equation 17:

$\begin{matrix}{E_{{total}\mspace{14mu}{protein}} = {{E\left( {b - b} \right)} + {\sum\limits_{all\_ i}^{\;}\; E_{({ia})}} + {\sum\limits_{{all\_ i},{j\_ pairs}}^{\;}{\sum E_{({{ia},{ja}})}}}}} & \left( {{Equation}\mspace{14mu} 17} \right)\end{matrix}$

Thus every possible combination of rotamers may be directly evaluated byadding the backbone-backbone (sometimes referred to herein astemplate-template) energy (E_((b-b)) which is constant over allsequences herein since the backbone is kept constant), the singlesenergy for each rotamer (which has already been calculated and stored),and the doubles energy for each rotamer pair (which has already beencalculated and stored). Each total sequence energy of each possiblerotamer sequence can then be ranked, either from best to worst or worstto best. This is obviously computationally expensive and becomesunwieldy as the length of the protein increases.

In a preferred embodiment, the computational processing includes one ormore Dead-End Elimination (DEE) computational steps. The DEE theorem isthe basis for a very fast discrete search program that was designed topack protein side chains on a fixed backbone with a known sequence. SeeDesmet, et al., Nature 356:539–542 (1992); Desmet, et al., The ProteinsFolding Problem and Tertiary Structure Prediction, Ch. 10:1–49 (1994);Goldstein, Biophys. Jour. 66:1335–1340 (1994), all of which areincorporated herein by reference. DEE is based on the observation thatif a rotamer can be eliminated from consideration at a particularposition, i.e. make a determination that a particular rotamer isdefinitely not part of the global optimal conformation, the size of thesearch is reduced. This is done by comparing the worst interaction (i.e.energy or E_(total)) of a first rotamer at a single variable positionwith the best interaction of a second rotamer at the same variableposition. If the worst interaction of the first rotamer is still betterthan the best interaction of the second rotamer, then the second rotamercannot possibly be in the optimal conformation of the sequence. Theoriginal DEE theorem is shown in Equation 18:E(ia)+Σ[min(j)over t{E(ia,jt)}]>E(ib)+[max(j)overt{E(ib,jt)}]  (Equation 18)

In Equation 18, rotamer ia is being compared to rotamer ib. The leftside of the inequality is the best possible interaction energy(E_(total)) of ia with the rest of the protein; that is, “min over t”means find the rotamer t on position j that has the best interactionwith rotamer ia. Similarly, the right side of the inequality is theworst possible (max) interaction energy of rotamer ib with the rest ofthe protein. If this inequality is true, then rotamer ia is Dead-Endingand can be Eliminated. The speed of DEE comes from the fact that thetheorem only requires sums over the sequence length to test andeliminate rotamers.

In a preferred embodiment, a variation of DEE is performed. GoldsteinDEE, based on Goldstein, (1994) (supra), hereby expressly incorporatedby reference, is a variation of the DEE computation, as shown inEquation 19:E(ia)−E(ib)+Σ[min over t{E(ia,jt)−E(ib,jt)}]>0  (Equation 19)

In essence, the Goldstein Equation 19 says that a first rotamer a of aparticular position i (rotamer ia) will not contribute to a local energyminimum if the energy of conformation with ia can always be lowered byjust changing the rotamer at that position to ib, keeping the otherresidues equal. If this inequality is true, then rotamer ia isDead-Ending and can be Eliminated.

Thus, in a preferred embodiment, a first DEE computation is done whererotamers at a single variable position are compared, (“singles” DEE) toeliminate rotamers at a single position. This analysis is repeated forevery variable position, to eliminate as many single rotamers aspossible. In addition, every time a rotamer is eliminated fromconsideration through DEE, the minimum and maximum calculations ofEquation 18 or 19 change, depending on which DEE variation is used, thusconceivably allowing the elimination of further rotamers. Accordingly,the singles DEE computation can be repeated until no more rotamers canbe eliminated; that is, when the inequality is not longer true such thatall of them could conceivably be found on the global optimum.

In a preferred embodiment, “doubles” DEE is additionally done. Indoubles DEE, pairs of rotamers are evaluated; that is, a first rotamerat a first position and a second rotamer at a second position arecompared to a third rotamer at the first position and a fourth rotamerat the second position, either using original or Goldstein DEE. Pairsare then flagged as nonallowable, although single rotamers cannot beeliminated, only the pair. Again, as for singles DEE, every time arotamer pair is flagged as nonallowable, the minimum calculations ofEquation 18 or 19 change (depending on which DEE variation is used) thusconceivably allowing the flagging of further rotamer pairs. Accordingly,the doubles DEE computation can be repeated until no more rotamer pairscan be flagged; that is, where the energy of rotamer pairs overlap suchthat all of them could conceivably be found on the global optimum.

In addition, in a preferred embodiment, rotamer pairs are initiallyprescreened to eliminate rotamer pairs prior to DEE. This is done bydoing relatively computationally inexpensive calculations to eliminatecertain pairs up front. This may be done in several ways, as is outlinedbelow.

In a preferred embodiment, the rotamer pair with the lowest interactionenergy with the rest of the system is found. Inspection of the energydistributions in sample matrices has revealed that an i_(u) j_(v) pairthat dead-end eliminates a particular i_(r) j_(s) pair can alsoeliminate other i_(r) j_(s) pairs. In fact, there are often a few i_(u)j_(v) pairs, which we call “magic bullets,” that eliminate a significantnumber of i_(r) j_(s) pairs. We have found that one of the most potentmagic bullets is the pair for which maximum interaction energy,t_(max)([i_(u) j_(v)])k_(t), is least. This pair is referred to as[i_(u) j_(v)]mb If this rotamer pair is used in the first round ofdoubles DEE, it tends to eliminate pairs faster.

Our first speed enhancement is to evaluate the first-order doublescalculation for only the matrix elements in the row corresponding to the[i_(u) j_(v)]_(mb) pair. The discovery of [i_(u) j_(v)]_(mb) is an n²calculation (n=the number of rotamers per position), and the applicationof Equation 19 to the single row of the matrix corresponding to thisrotamer pair is another n² calculation, so the calculation time is smallin comparison to a full first-order doubles calculation. In practice,this calculation produces a large number of dead-ending pairs, oftenenough to proceed to the next iteration of singles elimination withoutany further searching of the doubles matrix.

The magic bullet first-order calculation will also discover alldead-ending pairs that would be discovered by the Equation 18 or 19,thereby making it unnecessary. This stems from the fact that.epsilon._(max) ([i_(u) ij_(v)]_(mb)) must be less than or equal to any.epsilon._(max) ([i_(u) j_(v)]) that would successfully eliminate a pairby the Equation 18 or 19.

Since the minima and maxima of any given pair has been precalculated asoutlined herein, a second speed-enhancement precalculation may be done.By comparing extrema, pairs that will not dead end can be identified andthus skipped, reducing the time of the DEE calculation. Thus, pairs thatsatisfy either one of the following criteria are skipped:ε_(min)([i _(r) j _(s)])<ε_(min)([i _(u) j _(v)])  (Equation 20)ε_(min)([i _(r) j _(s)])<ε_(min)([i _(u) j _(v)])  (Equation 21)

Because the matrix containing these calculations is symmetrical, half ofits elements will satisfy the first inequality Equation 20, and half ofthose remaining will satisfy the other inequality Equation 21. Thesethree quarters of the matrix need not be subjected to the evaluation ofEquation 18 or 19, resulting in a theoretical speed enhancement of afactor of four.

The last DEE speed enhancement refines the search of the remainingquarter of the matrix. This is done by constructing a metric from theprecomputed extrema to detect those matrix elements likely to result ina dead-ending pair.

A metric was found through analysis of matrices from different sampleoptimizations. We searched for combinations of the extrema thatpredicted the likelihood that a matrix element would produce adead-ending pair. Interval sizes for each pair were computed fromdifferences of the extrema. The size of the overlap of the i_(r) j_(s)and i_(u) j_(v) intervals were also computed, as well as the differencebetween the minima and the difference between the maxima. Combinationsof these quantities, as well as the lone extrema, were tested for theirability to predict the occurrence of dead-ending pairs. Because some ofthe maxima were very large, the quantities were also comparedlogarithmically.

Most of the combinations were able to predict dead-ending matrixelements to varying degrees. The best metrics were the fractionalinterval overlap with respect to each pair, referred to herein as q_(rs)and q_(uv).

$\begin{matrix}\begin{matrix}{q_{rs} = {{interval}\mspace{14mu}{{overlap}/{interval}}\;\left( \left\lbrack {i_{r}j_{s}} \right\rbrack \right)}} \\{= {\left\{ {{ɛ_{\max}\left( \left\lbrack {i_{u}j_{v}} \right\rbrack \right)} - {ɛ_{\min}\left( \left\lbrack {i_{r}j_{s}} \right\rbrack \right)}} \right\}/\left\{ {{ɛ_{\max}\left( \left\lbrack {i_{r}j_{s}} \right\rbrack \right)} - {ɛ_{\min}\left( \left\lbrack {i_{r}j_{s}} \right\rbrack \right)}} \right\}}}\end{matrix} & \text{(Equation~~22)} \\\begin{matrix}{q_{uv} = {{interval}\mspace{14mu}{{overlap}/{interval}}\;\left( \left\lbrack {i_{u}j_{v}} \right\rbrack \right)}} \\{= {\left\{ {{ɛ_{\max}\left( \left\lbrack {i_{u}j_{v}} \right\rbrack \right)} - {ɛ_{\min}\left( \left\lbrack {i_{r}j_{s}} \right\rbrack \right)}} \right\}/\left\{ {{ɛ_{\max}\left( \left\lbrack {i_{u}j_{v}} \right\rbrack \right)} - {ɛ_{\min}\left( \left\lbrack {i_{u}j_{v}} \right\rbrack \right)}} \right\}}}\end{matrix} & \text{(Equation~~23)}\end{matrix}$

These values are calculated using the minima and maxima equations 24,25, 26 and 27:

$\begin{matrix}{{ɛ_{\max}\left( \left\lbrack {i_{r}j_{s}} \right\rbrack \right)} = {{ɛ\left( \left\lbrack {i_{r}j_{s}} \right\rbrack \right)} + {\sum\limits_{k \neq i \neq j}{{\max(1)}{ɛ\left( {\left\lbrack {i_{r}j_{s}} \right\rbrack,k} \right)}}}}} & \text{(Equation~~24)} \\{{ɛ_{\min}\left( \left\lbrack {i_{r}j_{s}} \right\rbrack \right)} = {{ɛ\left( \left\lbrack {i_{r}j_{s}} \right\rbrack \right)} + {\sum\limits_{k \neq i \neq j}{{\min(t)}{ɛ\left( {\left\lbrack {i_{r}j_{s}} \right\rbrack,k_{1}} \right)}}}}} & \text{(Equation~~25)} \\{{ɛ_{\max}\left( \left\lbrack {i_{u}j_{v}} \right\rbrack \right)} = {{ɛ\left( \left\lbrack {i_{u}j_{v}} \right\rbrack \right)} + {\sum\limits_{k \neq i \neq j}{{\max(t)}{ɛ\left( {\left\lbrack {i_{u}j_{v}} \right\rbrack,k_{t}} \right)}}}}} & \text{(Equation~~26)} \\{{ɛ_{\min}\left( \left\lbrack {i_{u}j_{v}} \right\rbrack \right)} = {{ɛ\left( \left\lbrack {i_{u}j_{v}} \right\rbrack \right)} + {\sum\limits_{k \neq i \neq j}{{\min(t)}{ɛ\left( {\left\lbrack {i_{u}j_{v}} \right\rbrack,k_{1}} \right)}}}}} & \text{(Equation~~27)}\end{matrix}$

These metrics were selected because they yield ratios of the occurrenceof dead-ending matrix elements to the total occurrence of elements thatare higher than any of the other metrics tested. For example, there arevery few matrix elements (˜2%) for which q_(rs)>0.98, yet these elementsproduce 30–40% of all of the dead-ending pairs.

Accordingly, the first-order doubles criterion is applied only to thosedoubles for which q_(rs)>0.98 and q_(uv)>0.99. The sample data analysespredict that by using these two metrics, as many as half of thedead-ending elements may be found by evaluating only two to five percentof the reduced matrix.

Generally, as is more fully described below, single and double DEE,using either or both of original DEE and Goldstein DEE, is run until nofurther elimination is possible. Usually, convergence is not complete,and further elimination must occur to achieve convergence. This isgenerally done using “super residue” DEE.

In a preferred embodiment, additional DEE computation is done by thecreation of “super residues” or “unification”, as is generally describedin Desmet, Nature 356:539–542 (1992); Desmet, et al., The ProteinFolding Problem and Tertiary Structure Prediction, Ch. 10:1–49 (1994);Goldstein, et al., supra. A super residue is a combination of two ormore variable residue positions which is then treated as a singleresidue position. The super residue is then evaluated in singles DEE,and doubles DEE, with either other residue positions or super residues.The disadvantage of super residues is that there are many more rotamericstates which must be evaluated; that is, if a first variable residueposition has 5 possible rotamers, and a second variable residue positionhas 4 possible rotamers, there are 20 possible super residue rotamerswhich must be evaluated. However, these super residues may be eliminatedsimilar to singles, rather than being flagged like pairs.

The selection of which positions to combine into super residues may bedone in a variety of ways. In general, random selection of positions forsuper residues results in inefficient elimination, but it can be done,although this is not preferred. In a preferred embodiment, the firstevaluation is the selection of positions for a super residue is thenumber of rotamers at the position. If the position has too manyrotamers, it is never unified into a super residue, as the computationbecomes too unwieldy. Thus, only positions with fewer than about 100,000rotamers are chosen, with less than about 50,000 being preferred andless than about 10,000 being especially preferred.

In a preferred embodiment, the evaluation of whether to form a superresidue is done as follows. All possible rotamer pairs are ranked usingEquation 28, and the rotamer pair with the highest number is chosen forunification:Fraction of flaggedpairs/log^((number of super rotamers resulting from the potential unification))  (Equation28)

Equation 28 is looking for the pair of positions that has the highestfraction or percentage of flagged pairs but the fewest number of superrotamers. That is, the pair that gives the highest value for Equation 28is preferably chosen. Thus, if the pair of positions that has thehighest number of flagged pairs but also a very large number of superrotamers (that is, the number of rotamers at position i times the numberof rotamers at position j), this pair may not be chosen (although itcould) over a lower percentage of flagged pairs but fewer superrotamers.

In an alternate preferred embodiment, positions are chosen for superresidues that have the highest average energy; that is, for positions iand j, the average energy of all rotamers for i and all rotamers for jis calculated, and the pair with the highest average energy is chosen asa super residue.

Super residues are made one at a time, preferably. After a super residueis chosen, the singles and doubles DEE computations are repeated wherethe super residue is treated as if it were a regular residue. As forsingles and doubles DEE, the elimination of rotamers in the superresidue DEE will alter the minimum energy calculations of DEE. Thus,repeating singles and/or doubles DEE can result in further eliminationof rotamers.

In summary, the calculation and storage of the singles and doublesenergies is the first step, although these may be recalculated everytime. This is followed by the optional application of a cutoff, wheresingles or doubles energies that are too high are eliminated prior tofurther processing. Either or both of original singles DEE or Goldsteinsingles DEE may be done, with the elimination of original singles DEEbeing generally preferred. Once the singles DEE is run, original doublesand/or Goldstein doubles DEE is run. Super residue DEE is then generallyrun, either original or Goldstein super residue DEE. This preferablyresults in convergence at a global optimum sequence. After any step anyor all of the previous steps can be rerun, in any order.

The addition of super residue DEE to the computational processing, withrepetition of the previous DEE steps, generally results in convergenceat the global optimum. Convergence to the global optimum is guaranteedif no cutoff applications are made, although generally a global optimumis achieved even with these steps. In a preferred embodiment, DEE is rununtil the global optimum sequence is found. That is, the set ofoptimized protein sequences contains a single member, the globaloptimum.

In a preferred embodiment, the various DEE steps are run until amanagable number of sequences is found, i.e. no further processing isrequired. These sequences represent a set of optimized proteinsequences, and they can be evaluated as is more fully described below.Generally, for computational purposes, a manageable number of sequencesdepends on the length of the sequence, but generally ranges from about 1to about 10¹⁵ possible rotamer sequences.

Alternatively, DEE is run to a point, resulting in a set of optimizedsequences (in this context, a set of remainder sequences) and thenfurther compututational processing of a different type may be run. Forexample, in one embodiment, direct calculation of sequence energy asoutlined above is done on the remainder possible sequences.Alternatively, a Monte Carlo search can be run.

In a preferred embodiment, the computation processing need not comprisea DEE computational step. In this embodiment, a Monte Carlo search isundertaken, as is known in the art. See Metropolis et al., J. Chem.Phys. 21:1087 (1953), hereby incorporated by reference. In thisembodiment, a random sequence comprising random rotamers is chosen as astart point. In one embodiment, the variable residue positions areclassified as core, boundary or surface residues and the set ofavailable residues at each position is thus defined. Then a randomsequence is generated, and a random rotamer for each amino acid ischosen. This serves as the starting sequence of the Monte Carlo search.A Monte Carlo search then makes a random jump at one position, either toa different rotamer of the same amino acid or a rotamer of a differentamino acid, and then a new sequence energy (E_(total) sequence) iscalculated, and if the new sequence energy meets the Boltzmann criteriafor acceptance, it is used as the starting point for another jump. Ifthe Boltzmann test fails, another random jump is attempted from theprevious sequence. In this way, sequences with lower and lower energiesare found, to generate a set of low energy sequences.

If computational processing results in a single global optimum sequence,it is frequently preferred to generate additional sequences in theenergy neighborhood of the global solution, which may be ranked. Theseadditional sequences are also optimized protein sequences. Thegeneration of additional optimized sequences is generally preferred soas to evaluate the differences between the theoretical and actualenergies of a sequence. Generally, in a preferred embodiment, the set ofsequences is at least about 75% homologous to each other, with at leastabout 80% homologous being preferred, at least about 85% homologousbeing particularly preferred, and at least about 90% being especiallypreferred. In some cases, homology as high as 95% to 98% is desirable.Homology in this context means sequence similarity or identity, withidentity being preferred. Identical in this context means identicalamino acids at corresponding positions in the two sequences which arebeing compared. Homology in this context includes amino acids which areidentical and those which are similar (functionally equivalent). Thishomology will be determined using standard techniques known in the art,such as the Best Fit sequence program described by Devereux, et al.,Nucl. Acid Res., 12:387–395 (1984), or the BLASTX program (Altschul, etal., J. Mol. Biol., 215:403–410 (1990)) preferably using the defaultsettings for either. The alignment may include the introduction of gapsin the sequences to be aligned. In addition, for sequences which containeither more or fewer amino acids than an optimum sequence, it isunderstood that the percentage of homology will be determined based onthe number of homologous amino acids in relation to the total number ofamino acids. Thus, for example, homology of sequences shorter than anoptimum will be determined using the number of amino acids in theshorter sequence.

Once optimized protein sequences are identified, the processingoptionally proceeds to a step which entails searching the proteinsequences. This processing may be implemented with a set of computercode that executes a search strategy. For example, the search mayinclude a Monte Carlo search as described above. Starting with theglobal solution, random positions are changed to other rotamers allowedat the particular position, both rotamers from the same amino acid androtamers from different amino acids. A new sequence energy (E_(total)sequence) is calculated, and if the new sequence energy meets theBoltzmann criteria for acceptance, it is used as the starting point foranother jump. See Metropolis et al., 1953, supra, hereby incorporated byreference. If the Boltzmann test fails, another random jump is attemptedfrom the previous sequence. A list of the sequences and their energiesis maintained during the search. After a predetermined number of jumps,the best scoring sequences may be output as a rank-ordered list.Preferably, at least about 10⁶ jumps are made, with at least about 10⁷jumps being preferred and at least about 10⁸ jumps being particularlypreferred. Preferably, at least about 100 to 1000 sequences are saved,with at least about 10,000 sequences being preferred and at least about100,000 to 1,000,000 sequences being especially preferred. During thesearch, the temperature is preferably set to 1000 K.

Once the Monte Carlo search is over, all of the saved sequences arequenched by changing the temperature to 0 K, and fixing the amino acididentity at each position. Preferably, every possible rotamer jump forthat particular amino acid at every position is then tried.

The computational processing results in a set of optimized proteinsequences that are best suited to bind the intended amino acid analog.These optimized protein sequences may be significantly different fromthe wild-type sequence from which the backbone was taken. That is, eachoptimized protein sequence may comprises at least one residue change, orat least about 1–2%, 2–5%, 5–10% or more variant amino acids from thestarting or wild-type sequence.

In a preferred embodiment, one, some or all of the optimized AARSsequences are constructed into designed proteins. Thereafter, the AARSsequences can be tested for their ability, specificity and/or efficiencyto incorporate amino acid analogs into proteins in in vitro and/or invivo assay. Generally, this can be done in one of two ways.

In a preferred embodiment, the designed AARS may be chemicallysynthesized as is known in the art. This is particularly useful when thedesigned AARSs or functional fragments are short, preferably less than150 amino acids in length, with less than 100 amino acids beingpreferred, and less than 50 amino acids being particularly preferred,although as is known in the art, longer proteins can be made chemicallyor enzymatically.

In a preferred embodiment, particularly for longer AARS proteins orproteins for which large samples are desired, the optimized sequence isused to create a nucleic acid such as DNA which encodes the optimizedsequence and which can then be cloned into a host cell and expressed.Thus, nucleic acids, and particularly DNA, can be made which encodeseach optimized AARS sequence. This is done using well known procedures.The choice of codons, suitable expression vectors and suitable hostcells will vary depending on a number of factors, and can be easilyoptimized as needed.

Once made, the designed AARS proteins are experimentally evaluated andtested for structure, altered function and stability, as necessary. Thiswill be done as is known in the art, and will depend in part on theoriginal protein from which the protein backbone structure was taken.Preferably, the designed AARS proteins are more stable than the knownAARS protein that was used as the starting point, although in somecases, if some constaints are placed on the methods, the designedprotein may be less stable. Thus, for example, it is possible to fixcertain residues for altered biological activity and find the moststable sequence, but it may still be less stable than the wild typeprotein. Stable in this context means that the new protein retainseither biological activity or conformation past the point at which theparent molecule did. Stability includes, but is not limited to, thermalstability, i.e. an increase in the temperature at which reversible orirreversible denaturing starts to occur; proteolytic stability, i.e. adecrease in the amount of protein which is irreversibly cleaved in thepresence of a particular protease (including autolysis); stability toalterations in pH or oxidative conditions; chelator stability; stabilityto metal ions; stability to solvents such as organic solvents,surfactants, formulation chemicals; etc.

In a preferred embodiment, the modelled AARS proteins are at least about5% more stable than the original protein, with at least about 10% beingpreferred and at least about 20–50% being especially preferred.

The results of the testing operations may be computationally assessed.That is, computer code may be prepared to analyze the test data withrespect to any number of metrices.

At this processing juncture, if the AARS protein is selected then theprotein is utilized. If a protein is not selected, the accumulatedinformation may be used to alter the ranking scheme discussed above,and/or to repeat searching for more sequences.

In a preferred embodiment, the experimental results are used for designfeedback and design optimization.

Once made, the AARS proteins of the invention find use in a wide varietyof applications (see exemplary uses below), as will be appreciated bythose in the art. Similarly, the methods of the present invention allowthe generation of useful pharmaceutical proteins, such as analogs ofknown proteinaceous drugs which are more thermostable, lessproteolytically sensitive, or contain other desirable changes.

In still other embodiments, an AARS sequence may be compared to one ormore (preferably to a plurality of) AARS sequences to identifyhomologous sequences. For example, in one preferred embodiment, aparticular AARS protein sequence may be aligned with a plurality ofother AARS sequences (e.g., from a database of naturally occurringprotein or other sequences, such as the GenBank, SWISPROT or EMBLdatabase) to identify homologous sequences. Sequences that have acertain level of sequence similarity may also be compared. Generally,the level of sequence similarity is a threshold level or percentage ofsequence homology (or sequence identity) that may be selected by a user.For example, preferred levels of sequence homology (or identity) are atleast 55%, at least 60%, at least 65%, at least 70%, at least 75%, atleast 80%, at least 85%, at least 90%, at least 95% or at least 99%. Avariety of methods and algorithms are known in the art for aligningsequences and/or determining their levels of sequence similarity. Any ofthese methods and algorithms may be used in connection with thisinvention. Exemplary algorithms include, but are not limited to, theBLAST family of algorithms, FASTA, MEGALIGN and CLUSTAL.

Once homologous sequences have been identified and/or aligned with theparent AARS sequence, the site entropy of one or more particularresidues in the parent AARS sequence may be determined or estimated fromthe number of homologous or aligned sequences in which the particularresidue is mutated. As used to describe this invention, a homologous oraligned sequence is said to have a mutation at a particular residue if,in an alignment of the particular sequence (e.g., from the alignmentalgorithm used to identify a homologous sequence or sequences), theresidue in the homologous sequence which that aligns with the particularresidue in the parent sequence is different (i.e., has a differentidentity) from the particular residue. The probabilities required todetermine the site entropy can be calculated by the relative number oftimes each amino acid appears at each residue in the alignment.

F. RBIAS—for Enhancing Protein-substrate Interactions

All the mutant sequences predicted in the above calculation gear towardsmaking enough space in the binding pocket for accommodating the analog,but besides van der Waals interactions, there may be little or nospecific interaction between the analog and the synthetase.

In order to design specificity into the AARS-analog interaction, theprogram RBIAS or its equivalent may be used to enhance the interactionsbetween the analog and the protein positions. This is done by scaling upthe pair-wise energies between the analog and the amino acids allowed atthe design positions on the protein in the energy calculations. In anoptimization calculation where the protein-analog interactions arescaled up compared to the intra-protein interactions, sequence selectionwill be biased toward selecting amino acids to be those that havefavorable interaction with the analog.

Multiple calculations can be performed by scaling up the analog-proteininteractions by factors of, for example, 2.0 to 20.0, in increments of,for example, 0.5, 1.5, or preferably 2.0. Certain scale factors areexpected to generate certain interesting mutations, which, for example,would makes a hydrogen bond with a group of the analog. The interactionbetween the analog and the AARS in the sequence may be enhanced by avalue less than the value lost due to destabilization of the overallstructure, as indicated by the total energy. Moreover, polar groups inthe core, especially those that are not involved in a salt-bridge or ahydrogen bond may significantly destabilize proteins. Therefore, onlysome amount of overall protein stability can be traded off in order togain specificity between the protein and the analog. However, based onthese considerations, certain candidate sequence mutations orcombinations thereof will emerge, and these potential sequence changesare good candidates for further testing, such as constructing the mutantprotein using standard recombinant DNA technology, and testing thesemutants in vitro and/or in vivo in order to assess the ability,specificity, and/or efficiency of these mutants to incorporate theanalogs into protein products. Preferably, the re-designed AARS isspecific for the analog it is intended to charge onto tRNA, and cannotcharge natural amino acids to tRNA.

G. Methods for Predicting 3D Structure Based on Sequence Homology

For AARS proteins that have not been crystallized or been the focus ofother structural determinations, a computer-generated molecular model ofthe AARS and its binding site can nevertheless be generated using any ofa number of techniques available in the art. For example, the Cα-carbonpositions of the target AARS sequence can be mapped to a particularcoordinate pattern of an AARS enzyme (“known AARS”) having a similarsequence and deduced structure using homology modeling techniques, andthe structure of the target protein and velocities of each atomcalculated at a simulation temperature (To) at which a dockingsimulation with an amino acid analog is to be determined. Typically,such a protocol involves primarily the prediction of side-chainconformations in the modeled target AARS protein, while assuming amain-chain trace taken from a tertiary structure, such as provided bythe known AARS protein. Computer programs for performing energyminimization routines are commonly used to generate molecular models.For example, both the CHARMM (Brooks et al. (1983) J Comput Chem4:187–217) and AMBER (Weiner et al (1981) J. Comput. Chem. 106: 765)algorithms handle all of the molecular system setup, force fieldcalculation, and analysis (see also, Eisenfield et al. (1991) Am JPhysiol 261:C376–386; Lybrand (1991) J Pharm Belg 46:49–54; Froimowitz(1990) Biotechniques 8:640–644; Burbam et al. (1990) Proteins 7:99–111;Pedersen (1985) Environ Health Perspect 61:185–190; and Kini et al.(1991) J Biomol Struct Dyn 9:475–488). At the heart of these programs isa set of subroutines that, given the position of every atom in themodel, calculate the total potential energy of the system and the forceon each atom. These programs may utilize a starting set of atomiccoordinates, the parameters for the various terms of the potentialenergy function, and a description of the molecular topology (thecovalent structure). Common features of such molecular modeling methodsinclude: provisions for handling hydrogen bonds and other constraintforces; the use of periodic boundary conditions; and provisions foroccasionally adjusting positions, velocities, or other parameters inorder to maintain or change temperature, pressure, volume, forces ofconstraint, or other externally controlled conditions.

Most conventional energy minimization methods use the input coordinatedata and the fact that the potential energy function is an explicit,differentiable function of Cartesian coordinates, to calculate thepotential energy and its gradient (which gives the force on each atom)for any set of atomic positions. This information can be used togenerate a new set of coordinates in an effort to reduce the totalpotential energy and, by repeating this process over and over, tooptimize the molecular structure under a given set of externalconditions. These energy minimization methods are routinely applied tomolecules similar to the subject AARS proteins.

In general, energy minimization methods can be carried out for a giventemperature, Ti, which may be different than the docking simulationtemperature, To. Upon energy minimization of the molecule at Ti,coordinates and velocities of all the atoms in the system are computed.Additionally, the normal modes of the system are calculated. It will beappreciated by those skilled in the art that each normal mode is acollective, periodic motion, with all parts of the system moving inphase with each other, and that the motion of the molecule is thesuperposition of all normal modes. For a given temperature, the meansquare amplitude of motion in a particular mode is inverselyproportional to the effective force constant for that mode, so that themotion of the molecule will often be dominated by the low frequencyvibrations.

After the molecular model has been energy minimized at Ti, the system is“heated” or “cooled” to the simulation temperature, To, by carrying outan equilibration run where the velocities of the atoms are scaled in astep-wise manner until the desired temperature, To, is reached. Thesystem is further equilibrated for a specified period of time untilcertain properties of the system, such as average kinetic energy, remainconstant. The coordinates and velocities of each atom are then obtainedfrom the equilibrated system.

Further energy minimization routines can also be carried out. Forexample, a second class of methods involves calculating approximatesolutions to the constrained EOM for the protein. These methods use aniterative approach to solve for the Lagrange multipliers and, typically,only need a few iterations if the corrections required are small. Themost popular method of this type, SHAKE (Ryckaert et al. (1977) J ComputPhys 23:327; and Van Gunsteren et al. (1977) Mol Phys 34:1311) is easyto implement and scales as O(N) as the number of constraints increases.Therefore, the method is applicable to macromolecules such as AARSproteins. An alternative method, RATTLE (Anderson (1983) J Comput Phys52:24) is based on the velocity version of the Verlet algorithm. LikeSHAKE, RATTLE is an iterative algorithm and can be used to energyminimize the model of a subject AARS protein.

H. Alternative Methods

In other embodiments, rather than holding the identity of the amino acidanalog constant and varying the AARS structure (by modeling severaldifferent mutant structures), the subject method is carried out usingthe molecular model(s) for a single AARS mutant (e.g., in which one morenon-anchor amino acid residues are changed) and sampling a variety ofdifferent amino acid analogs or potential fragments thereof, to identifyanalogs which are likely to interact with, and be substrates for themutant AARS enzyme. This approach can make use of coordinate librariesfor amino acid analogs (including rotamer variants) or libraries offunctional groups and spacers that can be joined to form the sidechainof an amino acid analog.

Using such approaches as described above, e.g., homology modeling, acoordinate set for the binding site for the mutant AARS can be derived.

There are a variety of computational methods that can be readily adaptedfor identifying the structure of amino acid analogs that would haveappropriate steric and electronic properties to interact with thesubstrate binding site of an AARS mutant. See, for example, Cohen et al.(1990) J. Med. Cam. 33: 883–894; Kuntz et al. (1982) J. Mol. Biol161:269–288; DesJarlais (1988) J. Med. Cam. 31: 722–729; Bartlett et al.(1989) (Spec. Publ., Roy. Soc. Chem.) 78: 182 –196; Goodford et al.(1985) J. Med. Cam. 28: 849–857; DesJarlais et al. J. Med. Cam. 29:2149–2153). Directed methods generally fall into two categories: (1)design by analogy in which 3-D structures of known molecules (such asfrom a crystallographic database) are docked to the AARS binding sitestructure and scored for goodness-of-fit; and (2) de novo design, inwhich the amino acid analog model is constructed piece-wise in the AARSbinding site. The latter approach, in particular, can facilitate thedevelopment of novel molecules, uniquely designed to bind to the subjectAARS mutant binding site.

In an illustrative embodiment, the design of potential amino acidanalogs that may function with a particular AARS mutant begins from thegeneral perspective of shape complimentary for the substrate bindingsite of the enzyme, and a search algorithm is employed which is capableof scanning a database of small molecules of known three-dimensionalstructure for candidates which fit geometrically into the substratebinding site. Such libraries can be general small molecule libraries, orcan be libraries directed to amino acid analogs or small molecules whichcan be used to create amino acid analogs. It is not expected that themolecules found in the shape search will necessarily be leadsthemselves, since no evaluation of chemical interaction necessarily bemade during the initial search. Rather, it is anticipated that suchcandidates might act as the framework for further design, providingmolecular skeletons to which appropriate atomic replacements can bemade. Of course, the chemical complimentary of these molecules can beevaluated, but it is expected that atom types will be changed tomaximize the electrostatic, hydrogen bonding, and hydrophobicinteractions with the substrate binding site. Most algorithms of thistype provide a method for finding a wide assortment of chemicalstructures that may be complementary to the shape of the AARS substratebinding site.

For instance, each of a set of small molecules from a particulardata-base, such as the Cambridge Crystallographic Data Bank (CCDB)(Allen et al. (1973) J. Chem. Doc. 13: 119), is individually docked tothe binding site of the AARS mutant in a number of geometricallypermissible orientations with use of a docking algorithm. In a preferredembodiment, a set of computer algorithms called DOCK, can be used tocharacterize the shape of invaginations and grooves that form thebinding site. See, for example, Kuntz et al. (1982) J. Mol. Biol 161:269–288. The program can also search a database of small molecules fortemplates whose shapes are complementary to particular binding site ofthe AARS mutant. Exemplary algorithms that can be adapted for thispurpose are described in, for example, DesJarlais et al. (1988) J MedChem 31:722–729.

The orientations are evaluated for goodness-of-fit and the best are keptfor further examination using molecular mechanics programs, such asAMBER or CHARMM. Such algorithms have previously proven successful infinding a variety of molecules that are complementary in shape to agiven binding site of a receptor or enzyme, and have been shown to haveseveral attractive features. First, such algorithms can retrieve aremarkable diversity of molecular architectures. Second, the beststructures have, in previous applications to other proteins,demonstrated impressive shape complementarity over an extended surfacearea. Third, the overall approach appears to be quite robust withrespect to small uncertainties in positioning of the candidate atoms.

In certain embodiments, the subject method can utilize an algorithmdescribed by Goodford (1985, J Med Chem 28:849–857) and Boobbyer et al.(1989, J Med Chem 32:1083–1094). Those papers describe a computerprogram (GRID) which seeks to determine regions of high affinity fordifferent chemical groups (termed probes) on the molecular surface ofthe binding site. GRID hence provides a tool for suggestingmodifications to known ligands that might enhance binding. It may beanticipated that some of the sites discerned by GRID as regions of highaffinity correspond to “pharmacophoric patterns” determinedinferentially from a series of known ligands. As used herein, apharmacophoric pattern is a geometric arrangement of features of theanticipated amino acid analog that is believed to be important forbinding. Goodsell and Olson (1990, Proteins: Struct Funct Genet8:195–202) have used the Metropolis (simulated annealing) algorithm todock a single known ligand into a target protein, and their approach canbe adapted for identifying suitable amino acid analogs for docking withthe AARS binding site. This algorithm can allow torsional flexibility inthe amino acid sidechain and use GRID interaction energy maps as rapidlookup tables for computing approximate interaction energies.

Yet a further embodiment of the present invention utilizes a computeralgorithm such as CLIX which searches such databases as CCDB for smallmolecules which can be oriented in the substrate binding site of theAARS in a way that is both sterically acceptable and has a highlikelihood of achieving favorable chemical interactions between thecandidate molecule and the surrounding amino acid residues. The methodis based on characterizing the substrate binding site in terms of anensemble of favorable binding positions for different chemical groupsand then searching for orientations of the candidate molecules thatcause maximum spatial coincidence of individual candidate chemicalgroups with members of the ensemble. The current availability ofcomputer power dictates that a computer-based search for novel ligandsfollows a breadth-first strategy. A breadth-first strategy aims toreduce progressively the size of the potential candidate search space bythe application of increasingly stringent criteria, as opposed to adepth-first strategy wherein a maximally detailed analysis of onecandidate is performed before proceeding to the next. CLIX conforms tothis strategy in that its analysis of binding is rudimentary—it seeks tosatisfy the necessary conditions of steric fit and of having individualgroups in “correct” places for bonding, without imposing the sufficientcondition that favorable bonding interactions actually occur. A ranked“shortlist” of molecules, in their favored orientations, is producedwhich can then be examined on a molecule-by-molecule basis, usingcomputer graphics and more sophisticated molecular modeling techniques.CLIX is also capable of suggesting changes to the substituent chemicalgroups of the candidate molecules that might enhance binding. Again, thestarting library can be of amino acid analogs or of molecules which canbe used to generate the sidechain of an amino acid analog.

The algorithmic details of CLIX is described in Lawerence et al. (1992)Proteins 12:31–41, and the CLIX algorithm can be summarized as follows.The GRID program is used to determine discrete favorable interactionpositions (termed target sites) in the binding site of the AARS proteinfor a wide variety of representative chemical groups. For each candidateligand in the CCDB an exhaustive attempt is made to make coincident, ina spatial sense in the binding site of the protein, a pair of thecandidate's substituent chemical groups with a pair of correspondingfavorable interaction sites proposed by GRID. All possible combinationsof pairs of ligand groups with pairs of GRID sites are considered duringthis procedure. Upon locating such coincidence, the program rotates thecandidate ligand about the two pairs of groups and checks for sterichindrance and coincidence of other candidate atomic groups withappropriate target sites. Particular candidate/orientation combinationsthat are good geometric fits in the binding site and show sufficientcoincidence of atomic groups with GRID sites are retained.

Consistent with the breadth-first strategy, this approach involvessimplifying assumptions. Rigid protein and small molecule geometry ismaintained throughout. As a first approximation rigid geometry isacceptable as the energy minimized coordinates of the binding site ofthe AARS mutant, describe an energy minimum for the molecule, albeit alocal one.

A further assumption implicit in CLIX is that the potential ligand, whenintroduced into the substrate binding site of the AARS mutant, does notinduce change in the protein's stereochemistry or partial chargedistribution and so alter the basis on which the GRID interaction energymaps were computed. It must also be stressed that the interaction sitespredicted by GRID are used in a positional and type sense only, i.e.,when a candidate atomic group is placed at a site predicted as favorableby GRID, no check is made to ensure that the bond geometry, the state ofprotonation, or the partial charge distribution favors a stronginteraction between the protein and that group. Such detailed analysisshould form part of more advanced modeling of candidates identified inthe CLIX shortlist.

Yet another embodiment of a computer-assisted molecular design methodfor identifying amino acid analogs that may be utilized by apredetermined AARS mutant comprises the de novo synthesis of potentialinhibitors by algorithmic connection of small molecular fragments thatwill exhibit the desired structural and electrostatic complementaritywith the substrate binding site of the enzyme. The methodology employs alarge template set of small molecules with are iteratively piecedtogether in a model of the AARS' substrate binding site. Each stage ofligand growth is evaluated according to a molecular mechanics-basedenergy function, which considers van der Waals and coulombicinteractions, internal strain energy of the lengthening ligand, anddesolvation of both ligand and enzyme. The search space can be managedby use of a data tree which is kept under control by pruning accordingto the binding criteria.

In yet another embodiment, potential amino acid analogs can bedetermined using a method based on an energy minimization-quenchedmolecular dynamics algorithm for determining energetically favorablepositions of functional groups in the substrate binding site of a mutantAARS enzyme. The method can aid in the design of molecules thatincorporate such functional groups by modification of known amino acidand amino acid analogs or through de novo synthesis.

For example, the multiple copy simultaneous search method (MCSS)described by Miranker et al. (1991) Proteins 11: 29–34 can be adaptedfor use in the subject method. To determine and characterize a localminima of a functional group in the force field of the protein, multiplecopies of selected functional groups are first distributed in a bindingsite of interest on the AARS protein. Energy minimization of thesecopies by molecular mechanics or quenched dynamics yields the distinctlocal minima. The neighborhood of these minima can then be explored by agrid search or by constrained minimization. In one embodiment, the MCSSmethod uses the classical time dependent Hartee (TDH) approximation tosimultaneously minimize or quench many identical groups in the forcefield of the protein.

Implementation of the MCSS algorithm requires a choice of functionalgroups and a molecular mechanics model for each of them. Groups must besimple enough to be easily characterized and manipulated (3–6 atoms, fewor no dihedral degrees of freedom), yet complex enough to approximatethe steric and electrostatic interactions that the functional groupwould have in substrate binding to the site of the AARS protein. Apreferred set is, for example, one in which most organic molecules canbe described as a collection of such groups (Patai's Guide to theChemistry of Functional Groups, ed. S. Patai (New York: John Wiley, andSons, (1989)). This includes fragments such as acetonitrile, methanol,acetate, methyl ammonium, dimethyl ether, methane, and acetaldehyde.

Determination of the local energy minima in the binding site requiresthat many starting positions be sampled. This can be achieved bydistributing, for example, 1,000–5,000 groups at random inside a spherecentered on the binding site; only the space not occupied by the proteinneeds to be considered. If the interaction energy of a particular groupat a certain location with the protein is more positive than a givencut-off (e.g. 5.0 kcal/mole) the group is discarded from that site.Given the set of starting positions, all the fragments are minimizedsimultaneously by use of the TDH approximation (Elber et al. (1990) J AmChem Soc 112: 9161–9175). In this method, the forces on each fragmentconsist of its internal forces and those due to the protein. Theessential element of this method is that the interactions between thefragments are omitted and the forces on the protein are normalized tothose due to a single fragment. In this way simultaneous minimization ordynamics of any number of functional groups in the field of a singleprotein can be performed.

Minimization is performed successively on subsets of, e.g. 100, of therandomly placed groups. After a certain number of step intervals, suchas 1,000 intervals, the results can be examined to eliminate groupsconverging to the same minimum. This process is repeated untilminimization is complete (e.g. RMS gradient of 0.01 kcal/mole/Å). Thusthe resulting energy minimized set of molecules comprises what amountsto a set of disconnected fragments in three dimensions representingpotential sidechains for amino acid analogs.

The next step then is to connect the pieces with spacers assembled fromsmall chemical entities (atoms, chains, or ring moieties) to form aminoacid analogs, e.g., each of the disconnected can be linked in space togenerate a single molecule using such computer programs as, for example,NEWLEAD (Tschinke et al. (1993) J Med Chem 36: 3863, 3870). Theprocedure adopted by NEWLEAD executes the following sequence of commands(1) connect two isolated moieties, (2) retain the intermediate solutionsfor further processing, (3) repeat the above steps for each of theintermediate solutions until no disconnected units are found, and (4)output the final solutions, each of which is single molecule. Such aprogram can use for example, three types of spacers: library spacers,single-atom spacers, and fuse-ring spacers. The library spacers areoptimized structures of small molecules such as ethylene, benzene andmethylamide. The output produced by programs such as NEWLEAD consist ofa set of molecules containing the original fragments now connected byspacers. The atoms belonging to the input fragments maintain theiroriginal orientations in space. The molecules are chemically plausiblebecause of the simple makeup of the spacers and functional groups, andenergetically acceptable because of the rejection of solutions withvan-der Waals radii violations.

In addition, the order in which the steps of the present method areperformed is purely illustrative in nature. In fact, the steps can beperformed in any order or in parallel, unless otherwise indicated by thepresent disclosure.

Furthermore, the method of the present invention may be performed ineither hardware, software, or any combination thereof, as those termsare currently known in the art. In particular, the present method may becarried out by software, firmware, or microcode operating on a computeror computers of any type. Additionally, software embodying the presentinvention may comprise computer instructions in any form (e.g., sourcecode, object code, interpreted code, etc.) stored in anycomputer-readable medium (e.g., ROM, RAM, magnetic media, punched tapeor card, compact disc (CD) in any form, DVD, etc.). Furthermore, suchsoftware may also be in the form of a computer data signal embodied in acarrier wave, such as that found within the well-known Web pagestransferred among devices connected to the Internet. Accordingly, thepresent invention is not limited to any particular platform, unlessspecifically stated otherwise in the present disclosure.

Examplery computer hardware means suitable for carrying out theinvention can be a Silicon Graphics Power Challenge server with 10R10000 processors running in parallel. Suitable software developmentenvironment includes CERIUS2 by Biosym/Molecular Simulations (San Diego,Calif.), or other equivalents.

While particular embodiments of the present invention have been shownand described, it will be apparent to those skilled in the art thatchanges and modifications may be made without departing from thisinvention in its broader aspect and, therefore, the appended claims areto encompass within their scope all such changes and modifications asfall within the true spirit of this invention.

I. Exemplary Uses

To date, over 100 noncoded amino acids (all ribosomally acceptable) havebeen reportedly introduced into proteins using other methods (see, forexample, Schultz et al., J. Am. Chem. Soc., 103: 1563–1567, 1981;Hinsberg et al., J. Am. Chem. Soc., 104: 766–773, 1982; Pollack et al.,Science, 242: 1038–1040, 1988; Nowak et al., Science, 268: 439–442,1995) all these analogs may be used for designing suitable AARSs thatcan efficiently incorporate these analogs into protein products. Ingeneral, the re-designed AARSs (mutant AARSs) of the instant inventioncan be used to incorporate amino acid analogs into protein productseither in vitro or in vivo.

The ration of natural/analog amino acids may be controlled so that thedegree of incorporation may be adjusted. In a preferred embodiment, 100%of one or more of the natural amio acids (such as Phe) may be depletedso that only its analogs are incorporated.

In another preferred embodiment, two or more analogs may be used in thesame in vitro or in vivo translation system. These analogs can beanalogs of the same amino acid (for example, different Phe analogs), ordifferent amino acids (for example, analogs of Phe and Tyr,respectively).

For in vitro use, one or more AARSs of the instant invention can berecombinantly produced and supplied to any the available in vitrotranslation systems (such as the commercially available Wheat GermLysate-based PROTEINscript-PRO™, Ambion's E. coli system for coupled invitro transcription/translation; or the rabbit reticulocyte lysate-basedRetic Lysate IVT™ Kit from Ambion). Optionally, the in vitro translationsystem can be selectively depeleted of one or more natural AARSs (by,for example, immunodepletion using immobilized antibodies againstnatural AARS) and/or natural amino acids so that enhanced incorporationof the analog can be achieved. Alternatively, nucleic acids encoding there-designed AARSs may be supplied in place of recombinantly producedAARSs. The in vitro translation system is also supplied with the analogsto be incorporated into mature protein products.

Although in vitro protein synthesis usually cannot be carried out on thesame scale as in vivo synthesis, in vitro methods can yield hundreds ofmicrograms of purified protein containing amino acid analogs. Suchproteins have been produced in quantities sufficient for theircharacterization using circular dichroism (CD), nuclear magneticresonance (NMR) spectrometry, and X-ray crystallography. Thismethodology can also be used to investigate the role of hydrophobicity,packing, side chain entropy and hydrogen bonding in determining proteinstability and folding. It can also be used to probe catalytic mechanism,signal transduction and electron transfer in proteins. In addition, theproperties of proteins can be modified using this methodology. Forexample, photocaged proteins can be generated that can be activated byphotolysis, and novel chemical handles have been introduced intoproteins for the site specific incorporation of optical and otherspectroscopic probes.

The development of a general approach for the incorporation of aminoacid analogs into proteins in vivo, directly from the growth media,would greatly enhance the power of unnatural amino acid mutagenesis. Forexample, the ability to synthesize large quantities of proteinscontaining heavy atoms would facilitate protein structure determination,and the ability to site-specifically substitute fluorophores orphotocleavable groups into proteins in living cells would providepowerful tools for studying protein function in vivo. Alternatively, onemight be able to enhance the properties of proteins by providingbuilding blocks with new functional groups, such as a keto-containingamino acid.

For in vivo use, one or more AARS of the instant invention can besupplied to a host cell (prokaryotic or eukaryotic) as geneticmaterials, such as coding sequences on plasmids or viral vectors, whichmay optionally integrate into the host genome and constitutively orinducibly express the re-designed AARSs. A heterologous or endogenousprotein of interest can be expressed in such a host cell, at thepresence of supplied amino acid analogs. The protein products can thenbe purified using any art-recognized protein purification techniques, ortechniques specially designed for the protein of interest.

The above described uses are merely a few possible means for generatinga transcript which encodes a polypeptide. In general, any means known inthe art for generating transcripts can be employed to synthesizeproteins with amino acid analogs. For example, any in vitrotranscription system or coupled transcription/translation systems can beused for generate a transcript of interest, which then serves as atemplate for protein synthesis. Alternatively, any cell, engineeredcell/cell line, or functional components (lysates, membrane fractions,etc.) that is capable of expressing proteins from genetic materials canbe used to generate a transcript. These means for generating atranscript will typically include such components as RNA polymerase (T7,SP6, etc.) and co-factors, nucleotides (ATP, CTP, GTP, UTP), necessarytranscription factors, and appropriate buffer conditions, as well as atleast one suitable DNA template, but other components may also added foroptimized reaction condition. A skilled artisan would readily envisionother embodiments similar to those described herein.

IV. EXAMPLES

We have developed computational methods to predict sequence changes inaminoacyl tRNA synthetases that enable them to bind to novel amino acidanalogs. A rotamer library for the artificial amino acid is built byvarying its torsional angles to create rotamers that would fit in thebinding pocket for the natural substrate. The geometric orientation ofthe backbone of the amino acid analog is specified by thecrystallographic orientation of the backbone of the natural substrate inthe crystal structure. Amino acids in the binding pocket of thesynthetase that interact with the side chain on the analog are allowedto vary in identity and rotameric conformation in the subsequent proteindesign calculations.

We have also developed a computational method to enhance theinteractions between the substrate and the protein positions. This isdone by scaling up the pair-wise energies between the substrate and theamino acids allowed at the design positions on the protein in the energycalculations. In an optimization calculation where the protein-substrateinteractions are scaled up compared to the intra-protein interactions,sequence selection is biased toward selecting amino acids to be thosethat have favorable interaction with the substrate.

The details of the invention are described below by means of examples ofincreasing detail and specificity. These examples are provided only toillustrate preferred embodiments of the invention. However, theinvention is not limited to the particular embodiments, and manymodifications and variations of the invention will be apparent to thoseskilled in the art. Such modifications and variations are therefore alsoconsidered part of the invention.

In the following example, we have used Phenyalanyl tRNA synthetase(PheRS) as model to test the techniques. We have designed a new mutantform of E. Coli tRNA synthetase that allows efficient in vivoincorporation of aryl ketone functionality in recombinant proteins.

I. A Designed Phenylalanyl-tRNA Synthetase Variant Allows Efficient InVivo Incorporation of Non-natural Amino Acids (e.g. with Aryl KetoneFunctionality) into Proteins

We have previously exploited the ability of auxotrophic Escherichia colistrains to effect efficient incorporation of amino acid analogues intoproteins in a multi-site fashion. The method is simple and produces highprotein yields, and incorporation of the analogue at multiple sitesoffers significant advantages with respect to control of proteinproperties such as thermal and chemical stability.²

In this study, we report a computationally-designed variant of the E.coli phenylalanyl-tRNA synthetase (ePheRS), which allows efficient invivo incorporation of aryl ketone functionality into proteins. In 1991,Kast and coworkers³ introduced a variant of ePheRS (termed ePheRS*),which bears an Ala294Gly (A294G) mutation in the α-subunit and whichthereby acquires relaxed substrate specificity. We have recently shownthat over-expression of ePheRS* can be exploited to effect efficientincorporation of p-bromo-, p-iodo-, p-ethynyl-, p-cyano- andp-azidophenylalanines into recombinant proteins in E. coli hosts.^(4,5)But similar experiments with p-acetylphenylalanine (2 in FIG. 1) failed;even in a host in which ePheRS* was over-expressed, Phe-depletedcultures supplemented with 2 did not produce substantial yields ofprotein (FIG. 2).

Our interest in 2 arises from the chemical versatility of the side-chainketone function, which can be chemoselectively ligated with hydrazide,hydroxylamino, and thiosemicarbazide reagents under physiologicalconditions.^(6,7) Cornish and coworkers have accomplished site-specificincorporation of ketone functionality into recombinant proteins via invitro translation;⁸ however, we are unaware of previous reports of invivo methods of introducing ketone functionality into recombinantproteins.

It is desirable to identify/engineer additional ePheRS mutants thatwould allow efficient incorporation of 2 into recombinant proteins invivo. The technique can also be used to introduce other unnaturalsubstrates/substrate analogs that can not be efficiently incorporatedinto recombinant proteins in vivo using natural RNA synthetases orexisting mutant RNA synthetases with relaxed substrate specificity.

The crystal structure of Thermus thermophilus PheRS (tPheRS) complexedwith 1 is available⁹ and there is 43% overall sequence identity betweenePheRS and /PheRS; sequence identity in the identified active siteregion is 80%. We therefore employed a previously described proteindesign algorithm¹⁰ to identify potentially useful mutants of tPheRS,with the intention to prepare and evaluate the corresponding mutantforms of ePheRS. We generated a backbone-independent rotamer library for2, in which both the χ1 and χ2 torsional angles were varied by ±20° (inincrements of 5°) from the values of 1 in the tPheRS structure. Designcalculations were performed by fixing the identity of the substrate (2)and by allowing each of 11 non-anchor sites in the amino acid bindingpocket of tPheRS (determined from the crystal structure) to be occupiedby any of the twenty natural amino acids except for proline, methionineand cysteine. The anchor residues (Glu128, Glu130, Trp149, His178,Ser180, Gln183, and Arg204) were held fixed in identity and conformationin all calculations. These residues contribute important electrostaticinteractions with the substrate and it is reasonable to assume that suchinteractions are equally critical for the binding of 2.

The calculations identified two important cavity-forming mutations:Val261 (Thr251 in E. coli) to Gly, and Ala314 (Ala294 in E. coli) toGly. These predictions are consistent with the results of Reshetnikovaand coworkers,⁹ who pointed out that Ala314 and Val261 hinder thebinding of amino acids larger than phe (e.g., tyrosine) into active siteof tPheRS. Further confidence in the prediction was engendered by thefact that the Ala294Gly mutant allows incorporation of an interestingset of para-substituted phenylalanines, as described earlier.^(4,5) Wewere thus encouraged to test whether the additional Thr251Gly mutationwould relax the specificity of ePheRS* sufficiently to allowincorporation of 2 into proteins in vivo.

E coli pheS* (which encodes ePheRS*) was amplified by the polymerasechain reaction (PCR) from vector pQE-FS.⁴ Amplified pheS* was subjectedto PCR mutagenesis to create the coding sequence for the desiredThr251Gly mutant, which we designate pheS**. To allow constitutiveexpression of the synthetase, a tac promotor with an abolished lacrepressor binding site was inserted upstream of the start codon ofpheS**.¹¹ The expression cassette was then cloned into pQE15 (Qiagen),which encodes the marker protein mouse dihydrofolate reductase (mDHFR).The resulting plasmid was designated pQE-FS**. As a control, plasmidpQE-FS* (which contained pheS* under control of a constitutive tacpromoter) was constructed similarly. AF-IQ, a phenylalanine auxotrophicE. coli strain carrying the repressor plasmid pLysS-IQ,⁴ was transformedwith pQE15, pQE-FS*, or pQE-FS** to generate expression systemsAF-IQ[pQE15], AF-IQ[pQE-FS*], and AF-IQ[pQE-FS**], respectively. Thecapacity of 2 to support protein synthesis in each expression system wasdetermined by induction of mDHFR expression in phenylalanine-freeminimal media supplemented with 2. As shown in SDS-PAGE analysis ofwhole cell lysates (FIG. 2), neither AF-IQ[pQE15] nor AF-IQ[pQE-FS*]exhibits protein expression above background (-phe) in mediasupplemented with 2.12 We note that Phe starvation does not completelyeliminate background expression, presumably because of incompletedepletion of the cellular pool of phe. In contrast, similarlysupplemented cultures of AF-IQ[pQE-FS**] yield high levels of mDHFRexpression. The histidine-tagged protein from the latter culture(mDHFR-2) was purified in a yield of about 20 mg/L, approximately 60% ofthat obtained from cultures supplemented with 1. MALDI-TOF massspectrometry showed that the mass of mDHFR-2 was increased by 304 Da,which corresponds to approximately 80% replacement of 1 by 2 (mDHFRcontains 9 Phe residues). Incorporation of 2 was confirmed by trypticdigestion of mDHFR-2 (FIG. 3). For mDHFR, two peptides (Peptides A andB) in the mass range 1550–1750 Da were assigned to residues 34–47 and94–106, respectively (FIG. 3 a). Each fragment contains a single Pheresidue. The corresponding fragments of mDHFR-2 (FIG. 3 b) were shiftedup in mass by 42 Da, consistent with the increased mass of 2 relative to1.¹³

We have completed preliminary studies of the reactivity of mDHFR-2toward hydrazide reagents. Purified mDHFR and mDHFR-2 were dissolved inPBS buffer (pH 6.0) and treated either with 5 mM biotin hydrazide (BH)or with PBS buffer as a negative control. The reaction products wereanalyzed by western blotting and visualized by treatment with abiotin-specific streptavidin-HRP conjugate (FIG. 4). The products werealso examined for the presence of the 6×His tag of mDHFR to ensure theidentity of the protein band and to probe the possibility of chaincleavage under the ligation conditions. The results are consistent withchemoselective ligation without chain cleavage (FIG. 4 b).

In conclusion, we describe here a new mutant form of the E. coliphenylalanyl-tRNA synthetase, which was designed based on the knownstructure of the related Thermus thermophilus PheRS (tPheRS). The newmutant form of the E. coli phenylalanyl-tRNA synthetase (ePheRS**)allows efficient in vivo incorporation of reactive aryl ketonefunctionality into recombinant proteins. This study also demonstratesthe power of computational protein design in the development ofaminoacyl-tRNA synthetases for activation and charging of non-naturalamino acids.

Supporting Information

Plasmid Construction

E. coli pheS* was amplified by the polymerase chain reaction (PCR) fromvector pQE-FS⁴. Amplified pheS* was subjected to PCR mutagenesis tocreate the coding sequence for the desired Thr251Gly mutant, which wedesignatepheS**. To allow constitutive expression of the synthetase, alinker encoding a tac promoter with an abolished lac repressor bindingsite was prepared with terminal NheI restriction sites and internal NcoIand HiridIII sites. The linker sequence is:

5′CTAGCAGTTGACAA TTAATCATCGGCTCGTATAATG GATCGAATTGTGAGCGGAATCGATTTTCACACAGGAAACAGACC

GATCTTCGTCGCCATCCTCGGGTCGACGTCTGTTGCAAGCTTG-3′ (SEQ ID NO: 3, the -35and -10 sequences are underlined and the start codon is in bold).

This linker was cloned into the NheI site of vector pET5a (Novagen) toform pET5a-tac. PCR amplified fragments containing pheS* and pheS** werecloned into pET5a-tac at the NcoI and HindIII sites. pheS* and pheS**outfitted with the tac promoter were cut out as NheI fragments andinserted into expression plasmid pQE15 (Qiagen) to yield pQE-FS* andpQE-FS**, respectively. Expression plasmids pQE15, pQE-FS* and pQE-FS**encode the marker protein murine dihydrofolate reductase (mDHFR) undercontrol of a bacteriophage T5 promoter.

Determination of Translational Activity

Buffer and media were prepared according to standard protocols. Aphenylalanine auxotrophic derivative of E. coli strain BL21(DE3),designed AF (HsdS gal (χcIts857 ind I Dam5 lacUV5-T7 gene 1) pheA) andconstructed in our laboratory, was used as the expression host. The AFstrain was transformed with repressor plasmid pLysS-IQ and with pQE15,pQE-FS* or pQE-FS** to generate expression strains AF-IQ[pQE15],AF-IQ[pQE-FS*] or AF-IQ[pQE-FS**], respectively.

Small scale (10 ml) cultures were used to investigate the in vivotranslational activity of 2 (p-AcetylPhe). M9 minimal medium (50 ml)supplemented with 0.2% glucose, 1 mg/L thiamine, 1 mM MgSO₄, 1 mM CaCl₂,19 amino acids (at 20 mg/L), antibiotics (ampicillin 200 mg/L,chloramphenicol 35 mg/L) and phenylalanine (at 20 mg/L) was inoculatedwith 1 ml of an overnight culture of the expression strain. When theoptical density at 600 nm reached 0.8–1.0, a medium shift was performed.Cells were sedimented by centrifugation for 15 min at 3,100 g at 4° C.,the supernatant was removed and the cell pellets were washed twice with0.9% NaCl. Cells were resuspended in supplemented M9 medium containingeither: (a) 250 mg/L of p-AcetylPhe 2, (b) 20 mg/L phe (1) (positivecontrol), or (c) no phe or analog (negative control). Protein expressionwas induced 10 min after the medium shift by addition ofisopropyl-β-D-thiogalactoside (IPTG) to a final concentration of 1 mM.Cells were cultured for 4 hours post-induction and protein expressionwas monitored by SDS polyacrylamide gel electrophoresis (PAGE, 12%),using a normalized OD₆₀₀ of 0.2 per sample.

Protein Purification

mDHFR as expressed in this work contains an N-terminal hexahistidine(6×His) sequence, which was utilized to purify the protein by nickelaffinity chromatography with stepwise pH gradient elution underdenaturing conditions according to the recommendations of the supplier(Qiagen). The eluted protein was buffer-exchanged (Millipore, MWCO=5kDa) against distilled water three times and the purified protein wassubjected to matrix-assisted laser desorption ionization massspectrometry (MALDI-MS) analysis.

Tryptic Peptide Analysis

10 μl of purified protein in elution buffer (8M urea, 100 mM NaH₂PO₄, 10mM Tris, pH 4.5) was mixed with 90 μl 75 mM NH₄OAc, to which 2 μL ofmodified trypsin (Promega, 0.2 μg/μL) was added. The solution wasallowed to digest overnight at room temperature. The reaction wasquenched by addition of trifluoroacetic acid to pH<4.0. The digest wassubjected to sample clean-up by using ZipTip_(c18), which provided 2 μlof purified sample solution. 10 μl of the MALDI matric(α-cyano-β-hydroxycinnamic acid, 10 mg/ml in 50% CH₃CN) was added, and0.5 μl of the resulting solution was spotted directly onto the sampleplate. Samples were analyzed in the linear mode on an Applied BiosystemsVoyager DE Pro MALDI-TOF mass spectrometer.

Chemical Modification of Protein

Purified proteins (mDHFR-wt and mDHFR-2) were dissolved in 200 μl of PBSbuffer (pH 6.0) and added to 20 μl of 5 mM biotin hydrazide (BH,dissolved in PBS). Protein/BH mixtures were incubated at roomtemperature for 1 to 1.5 hr. Reaction solutions were then washed twicewith distilled water using a buffer-exchange column (Millipore, MWCO=5kDa). Standard Western blotting procedures were used to identifyproteins modified with BH as well as those bearing an N-terminalhexahistidine tag.

II. Design Calculations for Redesigning Phe Binding Pocket in Phe tRNASynthetase for Binding to DPA, A Phenylalanine Analog

To create mutant AARSs that are able to bind to unnatural amino acids,and eventually amino acylate 3′ ends of tRNAs, a three-step approach canbe adopted:

-   -   1) Design the binding site of natural AARSs by using a protein        design algorithm to generate mutants that will bind to        non-canonical amino acids;    -   2) Calculate simulated binding energies and check for        specificity of the designed mutants using a heir-DOCK protocol;    -   3) Experimentally test for the incorporation of the artificial        amino acid using an in vivo/in vitro system.        Methodology and Rationale

Model System

Phenylaanine-tRNA synthetase (PheRS): We have selected the PheRS as ourmodel for this study. PheRS is an (αβ)₂ enzyme with 350 amino acids inthe a subunit and 785 in the β subunit (Stepanov et al., 1992;Reshetnikova et al., 1999). The binding site for phenylalanine (Phe) isin the α subunit (FIG. 7). There are a number of reasons for selectingthis system. The crystal structure of PheRS bound to its naturalsubstrate, Phe, is readily available. An accurate description of thebinding pocket is critical for the computational design approach sinceit depends on the crystal structure for the protein backbonedescriptions, although in many cases it is perfectly acceptable to usecrystal structure of a homologous protein (for example, a homolog from arelated species) or even a conserved domain to substitute the bindingpocket. The crystal structure also defines the orientation of Phe in thebinding pocket of the synthetase. The aromatic ring of Phe is buried inthe hydrophobic pocket while the carbonyl and the amide groups of thebackbone make extensive electrostatic contacts with the charged andpolar residues at the mouth of the pocket. We would like to design thebinding pocket for the analogs so that these analogs bind to PheRS inthe same orientation as Phe, since this orientation may be important forthe adenylation step.

Another reason for using this system is that the structure of PheRS doesnot undergo any significant structural rearrangement on substratebinding as indicated by the crystal structures of the free PheRS and thePheRS-Phe complex. This makes the system well suited for our proteindesign algorithm which uses a fixed-backbone structure for side-chainselection. PheRS is an extensively studied system for the assimilationof artificial amino acids and has been used successfully inincorporation of a few Phe analogs. In fact, limited mutation analysishas also been done on this system to alter substrate specificity (Kastand Hennecke, 1991). Moreover, molecular dynamics (MD) simulations havebeen performed in an effort to understand the binding behavior of PheRStowards various analogs.

Analogs: The Phe analogs selected for the study are shown in FIG. 8.These analogs have been experimentally tested to check if they arereadily incorporated in proteins by the natural PheRS. MD simulationswere performed using these analogs and significant correlation wasachieved between the predicted binding energies and the in vivoincorporation rates exhibited by these analogs (Wang et al., 2001).

Binding Site Design

We used a protein design algorithm (Dahiyat and Mayo 1996; Dahiyat etal., 1997), ORBIT, to predict the optimal amino acid sequences of thebinding pocket for binding to the different analogs. Selection of aminoacids is performed using a very efficient search algorithm that relieson a discrete set of allowed conformations for each side chain andempirical potential energy functions that are used to calculatepair-wise interactions between side chains and the between the sidechains and backbone.

Surveys of protein structure database have shown that side chainsexhibit marked conformational preferences, and that most side chains arelimited to a small number of torsional angles. Thus, the torsionalflexibility of most amino acids can be represented with a discrete setof allowed conformations called rotamers. Rotameric preferences in sidechains are observed that depend on the main-chain conformations. ORBITaccounts for the torsional flexibilities of side chains by providingrotamer libraries that are based on those developed by Dunbrack andKarplus (Dunbrack and Karplus 1993; Dunbrack and Karplus 1994).

In our design, we would like to optimize the binding pocket for bindingto Phe analogs. In performing the optimization calculations, we wouldlike to vary the torsional angles of analogs and side chains lining thepocket simultaneously. This requires generating rotamer libraries forthe analogs, since they are not included in our standard rotamerlibraries. For all the natural amino acids, the possible χ1 and χ2angles are derived from database analysis. Since this is not feasible inthe case of artificial amino acids, the closest approximation for χ1 andχ2 angles for Phe analogs are taken to be the same as those for Phe.Moreover, we would like the analogs to have a conformation as close aspossible to the orientation of Phe in the binding pocket. So allowingsimilar torsional angles for the analogs as of Phe seems logical.

Since the residues in the pocket are buried in the protein structure, wehave force field parameters similar to those used in previous proteincore design calculations. The design algorithm uses energy terms basedon a force field that includes van der Waals interactions, electrostaticinteractions, hydrogen bonding, and solvation effects (Gordon et al.,1999).

Backbone independent rotamer libraries for all the analogs shown in FIG.8 are generated. Both the χ1 and χ2 torsional angles were varied tomatch those of Phe rotamers in our standard backbone independent rotamerlibrary. The torsional angles of Phe in the crystal structure (χ1:−101°,χ2:−104°) were also included in the new rotamer libraries for both Pheand the analogs. Charges were assigned only to the heavy atoms of theanalogs to be consistent with the way charges for the natural aminoacids are represented in ORBIT. The first analog selected for design wasDPA.

Design calculations were run by fixing the identity of the substrate tobe DPA and varying 11 other positions on PheRS (137, 184, 187, 222, 258,260, 261, 286, 290, 294, and 314). Positions 137, 184, 258, 260, 261,286, 290, 294, 314 were allowed to be any of the 20 natural amino acidsexcept proline (Pro or P), methionine (Met or M) and cysteine (Cys orC). Methionine was allowed at position 187 because its wild-typeidentity is Met, and only hydorphobic amino acids were allowed atposition 222. Most of these positions are buried in the core and anumber of them pack against Phe in the crystal structure. Mutationanalysis at position 294 has been shown to alter substrate specificity.The anchor residues (Glu 128, Glue 130, Trp 149, His 178, Ser 180, Gln183, Arg 204) were held fixed both in identity and conformation in allthe calculations. These make very important electrostatic interactionswith the substrate and this interaction is probably equally critical forthe analogs. From the crystal structure it appears that the anchorresidues hold the Phe zwitterion in a way that the carbonyl group of thezwitterion is close to the ATP binding site. This proximity may beimportant for the aminoadenylation reaction. The aminoadenylation stepis required for the incorporation of all the amino acids and hence, itseems important to make sure that the carbonyl and the amide groups ofthe analog zwitterions are also anchored the same way as the naturalsubstrate at this site.

In the first design attempt we allowed all the DPA rotamers that weregenerated in the rotamer library. The DPA rotamer selected in thestructure generated was not buried in the binding pocket and most of itis solvent-exposed. A second calculation was run allowing only those DPArotamers that would pack into the binding pocket. These are the rotamerswith all possible combinations of χ1 of −101 (±20°) and χ2 of −104(±20°) in increments of 5°. The structure generated in this calculationhas the aromatic ring of DPA buried in the pocket and is almostcompletely superimposable with the Phe in the PheRS crystal structure.

RBIAS—for Enhancing Protein-substrate Interactions

The sequence generated in the first calculation has two important cavityforming mutations which are Val261 to Glycine and Ala314 to Glycine.Mutation to Glycine at position 314 has been previously reported to haveenhanced the incorporation of larger Phe analogs that are notincorporated by the wild type synthetase. All the mutations predicted inthis calculation gear towards making enough space in the binding pocketfor accommodating DPA, but besides van der Waals interactions, we do notsee any specific interaction between DPA and the synthetase.

In an attempt to design specificity into the protein-substrateinteraction, we developed a program, RBIAS, which enhances theinteractions between the substrate and the protein positions. This isdone by scaling up the pair-wise energies between the substrate and theamino acids allowed at the design positions on the protein in the energycalculations. In an optimization calculation where the protein-substrateinteractions are scaled up compared to the intra-protein interactions,sequence selection will be biased toward selecting amino acids to bethose that have favorable interaction with the substrate.

We performed multiple calculations by scaling up the substrate-proteinby factors of 2.0 to 20.0, in increments of 2.0. A scale factor of 4.0generated an interesting mutation, Val 286 to Gin, which makes ahydrogen bond with acetyl group at the distal end of DPA. Theinteraction between DPA and the PheRS in this sequence was enhanced by2.12 kcal/mol although the complex is destabilized by 12.96 kcal/mol asindicated by the total energy. A bias scale factor of 18.0 generated anew mutation, Val 290 to Lysine. We believe this mutation is notimportant for specificity since lysine at this position is not makingsignificant interactions with DPA. Moreover, polar groups in the core,especially those that are not involved in a salt-bridge or a hydrogenbond may significantly destabilize proteins. Therefore, we can trade offonly some amount of overall protein stability in order to gainspecificity between the protein and the substrate. Based on theseconsiderations, the following sequence mutations or combinations thereofare good candidates for further testing:

Sequences Suggested for Further Testing

Mutations Double Glycine V261G, A314G Double Gly + Gln V261G, A314G,V286Q Double Gly + Gln 286 + Leu 290 V261G, A314G, V286Q, V290L

In all calculations, position 258 is mutated from Phe to Tyr. Thisposition is slightly exposed and is not in direct contact with DPA.Position 258 is very close to the anchor residues so mutating it mayaffect the transfer of the amino acid to its RNA. We recommend notincluding this mutation, We also recommend ignoring mutations V184I(this position is far enough from the substrate binding site andtherefore, may not affect binding) and L222A (this mutation is predictedbecause of a potential clash between Methionine rotamer at 187 andLeucine at 184 in the calculation).

Table for RBIAS calculations: A big energy clash between DPA and thebinding pocket in the wild type sequence (WT- indicates why DPA is notincorporated by the wild type synthetase Sequence 128 130 137 149 178180 183 184 187 204 222 258 WT_Phe E E L W H S Q V M R L F No bias_phe EE L W H S Q I M R A Y WT_DPA E E L W H S Q V M R L F No bias_dpa E E L WH S Q I M R L Y Bias 2.0 E E L W H S Q I M R L Y Bias 4.0 E E L W H S QI M R A Y Bias 6.0 E E L W H S Q I M R A Y Bias 8.0 E E L W H S Q I M RA Y Bias 10.0 E E L W H S Q I M R A Y Bias 12.0 E E L W H S Q I M R A YBias 14.0 E E L W H S Q I M R L Y Bias 16.0 E E L W H S Q I M R L Y Bias18.0 E E L W H S Q I M R A Y Bias 20.0 E E L W H S Q I M R A Y LigandTotal Sequence 260 261 286 290 294 314 Energy Energy WT_Phe F V V V V A−16.91 −240.71 No bias_phe F V L I I A −16.87 −242.1469 WT_DPA F V V V VA 1033145 51669.81 No bias_dpa F G L I V −21.40 −225.13 Bias 2.0 F G L IV G −21.40 −225.13 Bias 4.0 F G Q L V G −23.52 −212.17 Bias 6.0 F G Q LV G −23.52 −212.17 Bias 8.0 F G Q L V G −23.52 −212.17 Bias 10.0 F G Q LV G −23.52 −212.17 Bias 12.0 F G Q L V G −23.52 −212.17 Bias 14.0 F G QL V G −23.63 −209.34 Bias 16.0 F G Q L V G −23.63 −209.34 Bias 18.0 F GQ K V G −23.96 −198.83 Bias 20.0 F G Q K V G −23.96 −198.83

REFERENCES

-   1. (a) Dougherty, D. A. Curr. Opin. Chem Biol. 2000, 4, 645–652 (b)    Minks, C. Alefelder, S. Moroder, L., Huber, R., Budisa, N.    Tetrahedron 2000, 56, 9431–9442. (c) Behrens, C., Nielsen, J. N.,    Fan, X. J., Doisy, X., Kim, K. H., Praetorius-Ibba, M., Nielsen, P.    E., Ibba, M. Tetrahedron 2000, 56 9443–9449.-   2. Tang, Y; Giovanna, G.,; Petka, W. A.; Nakajima, T.; Degrado, W.    F.; Tirrell, D. A. Agnew. Chem. Int. Ed Engl. 2001, 40, 1494–1496.-   3. (a) Ibba, M.; Hennecke, H. FEBS Lett. 1995, 364, 272–275. (b)    Ibba, M.; Kast, P.; Hennecke, H. Biochemistry 1994, 33,    7107–7112. (c) Kast, P.; Hennecke, H. J. Mol. Biol 1991, 222,    99–124.-   4. Sharma, N.; Furter, R.; Tirrell, D. A. FEBS Lett 2000, 467,    37–40.-   5. Kirshenbaum, K; Carrico, I. S.; Tirrell, D. A. Chem. BioChem.    2001, I in press.-   6. (a) Rose, K. J Am. Chem. Soc. 1994, 116, 30–33. (b) Canne, L. E.;    FerreDAmare, S. K.; Burley, S. B. H. J. Am. Chem Soc. 1995, 117,    2998–3007. (c) Rideout, D.; Calogeropoulou, T. Biopolymers 1990, 29,    247–262. (d) Rideout, D. Cancer Invest. 1994, 12, 189–202. (e) Shao,    J.; Tam, J. P. J Am Chem Soc 1995, 117, 3893–3899.-   7. (a) Mahal, L. K.; Yarema, K. J.; Bertozzi, C. R. Science 1997,    276, 1125–1128. (b) Yarema, K. J.; Mahal, L. K.; Bruehl, R. E.;    Rodriguez, E. C.; Bertozzi, C. R. J. Biol. Chem 1998 273,    31168–31179.-   8. Cornish, V. W.; Hahn, K. M.; Schultz, P. G. J. Am. Chem. Soc    1996, 118, 8150–8151.-   9. (a) Reshetnikova, L.; Moor, N.; Lavrik, O.; Vassylyev, D. G. J.    Mol. Bio. 1999, 287, 555–568. (b) Fishman R.; Ankilova, V.; Moor,    N.; Safro, M. Acta Cryst. 2001, D57, 1534–1544.-   10. Dayiyat, B. I.; Mayo, S. G. Science 1997, 278, 82–87.-   11. Furter, R. Protein Sci 1998, 7, 419–426.-   12. Phe starvation does not completely eliminate background    expression, presumably because of incomplete depletion of the    cellular pool of phe.-   13. The enhanced promiscuity of ePheRS** allows misincorporation of    tryptophan under phe starvation conditions in the absence of 2. In    media supplemented with 2 there is no detectable misincorporation of    any of the canonical amino acids. We have not observed    misincorporation of tyrosine under any conditions.-   14. Budisa, N., Steipe, B., Demange, P. Eckerskorn, C., Kellermann,    J., and Huber, R. (1995). High-Level Biosynthetic Substitution of    Methionine in Proteins by its Analogs 2-Aminohexanoic Acid,    Selenomethionine, Telluromethionine, and Ethionine in E. coli.    Eur. J. Biochem 230: 788–796.-   15. Dahiyat, B. I. and Mayo, S. L. (1996). Protein Sci 5(5):    895–903.-   16. Dahiyat, B. I., Sarisky, C. A., Mayo, S. L. (1997). De novo    protein design: towards fully automated sequence selection. J Mol    Biol 273(4): 789–96.-   17. Deming, T. J., Fournier, M. J., Mason T. L., and Tirrell, D. A.    (1997). Biosynthetic Incorporation and Chemical Modification of    Alkene Functionality in Genetically Engineered Polymers. J.    Macromol. Sci. Pure Appl. Chem A34; 2143–2150.-   18. Duewel, H., Daub, E., Robinson, V., and Honek, J. F. (1997).    Incorporation of Trifluoromethionine into a Phage Lysozyme:    Implications and a New Marker for Use in Protein 19F NMR.    Biochemistry 36(3404–3416).-   19. Dunbrack, R. L. and Karplus, M. (1993). Backbone-dependent    rotamer library for proteins. Application to side-chain prediction.    J Mol Biol 230(2): 543–74.-   20. Dunbrack, R. L. and Karplus, M. (1994). Conformational analysis    of the backbone-dependent rotamer preferences of protein side    chains. Nat Struct Biol 1(5): 334–40.-   21. Ewing J. A., Kuntz, I. D. (1997). Critical evaluation of search    algorithms for automated molecular docking and database screening. J    Comp Chem 18: 1175–1189.-   22. Floriano, W. B., Vaidehi, N., Goddard, W. A., Singer, M. S.,    Shepherd, G. M. (2000). Molecular mechanisms underlying differential    odor responses of a mouse olfactory receptor. Proc Natl Acad Sci USA    97(20): 10712–6.-   23. Ghosh, G., Pelka, H., Schulman, L. H. (1990). Identification of    the tRNA anticodon recognition site of E. coli methionyl-tRNA    synthetase. Biochemistry 29(9): 2220–5.-   24. Gordon, D. B., S. A. Marshall, Mayo, S. L. (1999). Energy    functions for protein design. Curr Opin Struct Biol 9(4): 509–13.-   25. Kast, P., Hennecke, H. (1991). Amino acid substrate specificity    of Escherichia coli phenylalanyl-tRNA synthetase altered by distinct    mutations. J Mol Biol. 222(1): 99–124.-   26. Reshetnikova, L., Moor, N., Lavrik, O., Vassylyev, D. G. (1999).    Crystal structures of phenylalanyl-tRNA synthetase complexed with    phenylalanine and a phenylalanyl-adenylate analogue. J Mol Biol    287(3): 555–68.-   27. Sharma, N., Furter, R., Kast, P., Tirrell, D. A. (2000).    Efficient introduction of arylbromide functionality into proteins in    vivo. FEBS Lett 467(1): 37–40.-   28. Stepanov, V. G., Moor, N. A., Ankilova, V. N. Lavrik, O. I.    (1992). Phenylalanyl-tRNA synthetase from Thermus thermophilus can    attach two molecules of phenylalanine to tRNA(Phe). FEBS Lett    311(3): 192–4.-   29. van Hest, J. C. and Tirrell, D. A. (1998). Efficient    introduction of alkene functionality into proteins in vivo. FEBS    Lett 428(1–2): 68–70.-   30. Wang, P., Vaidehi, N. Tirrell, D. A., and Goddard III, W. A.    (2001). In preparation.

The practice of the present invention will employ, unless otherwiseindicated, conventional techniques of molecular biology, cell biology,cell culture, microbiology and recombinant DNA, which are within theskill of the art. Such techniques are explained fully in the literature.See, for example, Molecular Cloning: A Laboratory Manual, 2^(nd) Ed.,ed. By Sambrook, Fritsch and Maniatis (Cold Spring Harbor LaboratoryPress: 1989); DNA Cloning, Volumes I and II (D. N. Glover ed., 1985);Oligonucleotide Synthesis (M. J. Gait ed., 1984); Mullis et al.; U.S.Pat. No. 4,683,195; Nucleic Acid Hybridization (B. D. Hames & S. J.Higgins eds. 1984); Transcription And Translation (B. D. Hames & S. J.Higgins eds. 1984); B. Perbal, A Practical Guide To Molecular Cloning(1984); the treatise, Methods In Enzymology(Academic Press, Inc., N.Y.);Methods In Enzymology, Vols. 154 and 155 (Wu et al. eds.),Immunochemical Methods In Cell And Molecular Biology (Mayer and Walker,eds., Academic Press, London, 1987).

The contents of all cited references (including literature references,issued patents, published patent applications as cited throughout thisapplication) are hereby expressly incorporated by reference.

EQUIVALENTS

Those skilled in the art will recognize, or be able to ascertain usingno more than routine experimentation, numerous equivalents to thespecific method and reagents described herein, including alternatives,variants, additions, deletions, modifications and substitutions. Suchequivalents are considered to be within the scope of this invention andare covered by the following claims.

1. A method for designing amino acid sequence change(s) in an aminoacyltRNA synthetase (AARS), wherein said change could enable said AARS toincorporate an amino acid analog of a natural amino acid substrate ofsaid AARS into a protein in vivo, comprising: (a) providing athree-dimensional structure model of said AARS with said natural aminoacid substrate; (b) providing a rotamer library for said amino acidanalog; (c) identifying, based on said three-dimensional structuremodel, anchor residues of said AARS interacting with the backbone ofsaid natural amino acid substrate; (d) identifying, in the bindingpocket of said AARS, amino acid position(s) involved in interaction withthe side chain of said natural amino acid substrate as candidatevariable residue position(s); (e) providing an AARS-analog backbonestructure by substituting said natural amino acid substrate with saidamino acid analog in said three-dimensional structure model, wherein thegeometric orientation of the backbone of said amino acid analog isspecified by the orientation of the backbone of said natural amino acidsubstrate, and wherein the backbone of said amino acid analog and allanchor residues identified in (c) are fixed in identity and rotamericconformation in relation to said AARS-analog backbone structure; (f)providing a group of potential rotamers for each of the candidatevariable residue position(s) identified in (d), wherein the group ofpotential rotamers for at least one of said variable residue position(s)has a rotamer selected from each of at least two different amino acidside chains; and (g) analyzing the interaction of each of the rotamersfor said amino acid analog and rotamers for said variable residueposition(s) with all or part of the remainder of said AARS-analogbackbone structure to generate a set of optimized protein sequences;wherein steps (b)–(d) are carried out in any order, and wherein said setof optimized protein sequences contain said amino acid sequencechange(s) in the AARS, which change could enable the AARS to incorporatethe amino acid analog into the protein in vivo.
 2. The method of claim1, further comprising: (h) identifying additional protein sequencechanges that favor interaction between said AARS and said amino acidanalog, by repeating step (g) while scaling up interactions between saidamino acid analog and said variable residue position(s).
 3. The methodof claim 1, further comprising testing said AARS in vivo for itsability, specificity, and/or efficiency for incorporating said aminoacid analog into a protein.
 4. The method of claim 2, further comprisingtesting said AARS in vivo for its ability, specificity, and/orefficiency for incorporating said amino acid analog into a protein. 5.The method of claim 1, wherein said three-dimensional structure model isa known crystallographic or NMR structure.
 6. The method of claim 1,wherein said three-dimensional structure model is provided by homologymodeling based on a known crystallographic or NMR structure of a homologof said AARS.
 7. The method of claim 6, wherein said homolog is at leastabout 70% identical to said AARS in the active site region.
 8. Themethod of claim 1, wherein the rotamer library for said amino acidanalog is a backbone-independent rotamer library.
 9. The method of claim1, wherein the rotamer library for said amino acid analog is a rotamerlibrary provided by varying both the χ1 and χ2 torsional angles by ±20degrees, in increment of 5 degrees, from the values of said naturalamino acid substrate in said AARS structure.
 10. The method of claim 1,wherein said AARS is Phe tRNA Synthetase (PheRS).
 11. The method ofclaim 10, wherein said PheRS is from E. coli.
 12. The method of claim 1,wherein said AARS is from a eukaryote.
 13. The method of claim 12,wherein said eukaryote is human.
 14. The method of claim 1, wherein saidamino acid analog is a derivative of at least one of the 20 naturalamino acids, with one or more functional groups not present in naturalamino acids.
 15. The method of claim 14, wherein said functional groupis selected from the group consisting of: bromo-, iodo-, ethynyl-,cyano-, azido-, aceytyl, aryl ketone, a photolabile group, a fluorescentgroup, and a heavy metal.
 16. The method of claim 14, wherein said aminoacid analog is a derivative of Phe.
 17. The method of claim 1, whereinsaid set of optimized protein sequences comprises the globally optimalprotein sequence.
 18. The method of claim 1, wherein said DEEcomputation is selected from the group consisting of original DEE andGoldstein DEE.
 19. The method of claim 1, wherein said analyzing stepincludes the use of at least one scoring function.
 20. The method ofclaim 19, wherein said scoring function is selected from the groupconsisting of a van der Waals potential scoring function, a hydrogenbond potential scoring function, an atomic solvation scoring function,an electrostatic scoring function and a secondary structure propensityscoring function.
 21. The method of claim 19, wherein said analyzingstep includes the use of at least two scoring functions.
 22. The methodof claim 19, wherein said analyzing step includes the use of at leastthree scoring functions.
 23. The method of claim 19, wherein saidanalyzing step includes the use of at least four scoring functions. 24.The method of claim 19, wherein said atomic solvation scoring functionincludes a scaling factor that compensates for over-counting.
 25. Themethod of claim 1, further comprising generating a rank ordered list ofadditional optimal sequences from said globally optimal proteinsequence.
 26. The method of claim 25, wherein said generating includesthe use of a Monte Carlo search.
 27. The method of claim 25, furthercomprising testing some or all of said protein sequences from saidordered list to produce potential energy test results.
 28. The method ofclaim 27, further comprising analyzing the correspondence between saidpotential energy test results and theoretical potential energy data. 29.The method of claim 1, wherein said amino acid analog can be buried inthe binding pocket of at least one of said set of optimized proteinsequences, or wherein said amino acid analog is completely or almostcompletely superimposable with the natural amino acid substrate in saidthree-dimensional structure.
 30. The method of claim 2, wherein in step(h), interactions between said amino acid analog and said variableresidue position(s) are scaled up by a factor of between 2.0 to 20.0,with an increment of 0.5, 1.5, or 2.0.